[04:02:51] 3Phabricator: Create system to sync GitHub issues with Phabricator - https://phabricator.wikimedia.org/T86991#981141 (10bd808) 3NEW [06:35:30] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [06:45:55] 3Phabricator: Phabricator feed broken - https://phabricator.wikimedia.org/T86998#981241 (10Wesalius) 3NEW [06:50:25] 3Phabricator: Phabricator feed broken - https://phabricator.wikimedia.org/T86998#981249 (10Wesalius) It has resolved, maybe problem on my side (browser?). [08:56:00] (03PS1) 10Hashar: Experimental mediawiki-vagrant-bundle-rspec job [integration/config] - 10https://gerrit.wikimedia.org/r/185399 (https://phabricator.wikimedia.org/T76627) [09:07:24] (03PS2) 10Hashar: Experimental mediawiki-vagrant rspec and rubocop jobs [integration/config] - 10https://gerrit.wikimedia.org/r/185399 (https://phabricator.wikimedia.org/T76627) [09:16:27] (03CR) 10Hashar: [C: 032] Experimental mediawiki-vagrant rspec and rubocop jobs [integration/config] - 10https://gerrit.wikimedia.org/r/185399 (https://phabricator.wikimedia.org/T76627) (owner: 10Hashar) [09:24:24] (03Merged) 10jenkins-bot: Experimental mediawiki-vagrant rspec and rubocop jobs [integration/config] - 10https://gerrit.wikimedia.org/r/185399 (https://phabricator.wikimedia.org/T76627) (owner: 10Hashar) [09:35:10] (03PS7) 10KartikMistry: WIP: Add generic npm-set-env to fix npm on */deploy repos [integration/config] - 10https://gerrit.wikimedia.org/r/184609 [09:36:13] hashar: you've already killed jslint jobs, right? [09:36:31] hashar: any more suggestion for https://gerrit.wikimedia.org/r/#/c/184609 ? [09:47:05] 3Engineering-Community, Phabricator: Engineering Community team quarterly review Jan 2015 - https://phabricator.wikimedia.org/T85986#981352 (10Qgil) 5Open>3Resolved We had a useful quarterly review yesterday. The slides can be found at https://commons.wikimedia.org/wiki/File:Engineering_Community_Quarterly_R... [09:49:47] 3Phabricator: Show percentage of teams migrated to Phabricator for project management - https://phabricator.wikimedia.org/T434#981358 (10Qgil) We had Engineering Community quarterly review yesterday. We talked about the percentage of teams migrated and the special attention required to the Trello users. I forgo... [09:57:38] kart_: we still have a bunch of jslint jobs [09:57:51] kart_: until developers migrate to a 'npm test' entry point and thus a -npm job [10:07:13] 3Phabricator: Phabricator as Wikimedia software project management tool - https://phabricator.wikimedia.org/T824#981403 (10Qgil) [10:07:15] 3Phabricator: Goal: The majority of WMF developer teams and sprints have moved to Phabricator - https://phabricator.wikimedia.org/T825#981397 (10Qgil) 5Open>3Resolved > From all the Wikimedia Foundation teams developing software, more than half of the teams and more than half of the ongoing sprints should ha... [10:11:53] 3Phabricator: Phabricator as Wikimedia software project management tool - https://phabricator.wikimedia.org/T824#981412 (10Qgil) 5Open>3Resolved Yesterday we had Engineering Community team quarterly review. We agreed that the top priority we set for October-December 2014 had been accomplished. @aklapper is n... [10:35:04] 3Beta-Cluster, operations: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#981453 (10yuvipanda) 5Open>3Resolved a:3yuvipanda They have a sane shell now! \o/ [10:35:05] 3Beta-Cluster, operations: mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - https://phabricator.wikimedia.org/T67591#981456 (10yuvipanda) [10:35:10] (03PS1) 10Hashar: Change gremlink mvn job from package to verify [integration/config] - 10https://gerrit.wikimedia.org/r/185413 [10:35:52] 3Beta-Cluster, Release-Engineering: Reduce [LOCAL HACK] changes on Beta Cluster to zero - https://phabricator.wikimedia.org/T76392#981462 (10yuvipanda) [10:35:53] 3Beta-Cluster, operations: mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - https://phabricator.wikimedia.org/T67591#981459 (10yuvipanda) 5Open>3Resolved a:3yuvipanda This and the associated issues (different shell, etc) have been fix. prod and beta are unified on mwdepl... [10:41:41] PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [10:43:55] RECOVERY - Puppet failure on deployment-salt is OK: OK: Less than 1.00% above the threshold [0.0] [10:44:59] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [10:45:07] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [10:45:31] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [10:45:33] hmm [10:45:35] what now [10:45:45] (03PS2) 10Hashar: Change gremlin mvn job from package to verify [integration/config] - 10https://gerrit.wikimedia.org/r/185413 [10:45:47] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [10:46:46] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [10:46:56] PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [10:46:56] PROBLEM - Puppet failure on deployment-restbase02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [10:47:15] (03CR) 10Hashar: [C: 032] "Job refreshed" [integration/config] - 10https://gerrit.wikimedia.org/r/185413 (owner: 10Hashar) [10:50:04] PROBLEM - Puppet failure on deployment-memc04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [10:50:57] hmmm [10:51:00] that’s all strange [10:51:05] sine there seem to be no failures [10:52:06] YuviPanda: at least on cache-bits01 we had: [10:52:07] Duplicate declaration: Sudo::Group[ops] is already declared in file /etc/puppet/manifests/role/labs.pp:12; cannot redeclare at /etc/puppet/modules/admin/manifests/group.pp:39 [10:52:39] hmm [10:52:42] that’s doubly strange [10:52:48] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [10:53:42] RECOVERY - Puppet failure on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:53:54] and if Shinken is now able to ssh to beta instances, you can close https://phabricator.wikimedia.org/T86143 :] [10:54:21] hashar: it can’t yet, actually. alex is looking into it [10:54:40] there is the project security rule [10:54:43] and the ferm::rules [10:54:47] that is ab it messy :] [10:55:28] 3Release-Engineering, Phabricator, Phabricator.org: Answer questions about ongoing maintenance of phabricator customizations/extensions - https://phabricator.wikimedia.org/T78464#981482 (10Qgil) 5Open>3Resolved Yesterday we discussed about Wikimedia Phabricator maintenance plans in the context of the Enginee... [10:56:27] (03Merged) 10jenkins-bot: Change gremlin mvn job from package to verify [integration/config] - 10https://gerrit.wikimedia.org/r/185413 (owner: 10Hashar) [10:56:27] hashar: I checked both [10:56:39] RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:56:39] RECOVERY - Puppet failure on deployment-fluoride is OK: OK: Less than 1.00% above the threshold [0.0] [10:57:27] 3Phabricator: Phabricator feed broken - https://phabricator.wikimedia.org/T86998#981487 (10Qgil) 5Open>3Invalid a:3Qgil Looks good to me as well now. Thank you for the fast reporting, and don't hesitate to reopen this task if you the problem happening again. [10:59:38] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [11:00:00] RECOVERY - Puppet failure on deployment-pdf02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:00:02] RECOVERY - Puppet failure on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [11:00:23] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:00:29] RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0] [11:01:51] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:02:02] 3Continuous-Integration: Migrate jsduck-publish jobs to run in labs via integration-publisher - https://phabricator.wikimedia.org/T86175#981500 (10hashar) [11:02:03] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#981499 (10hashar) [11:02:38] 3Beta-Cluster, Release-Engineering: Reduce [LOCAL HACK] changes on Beta Cluster to zero - https://phabricator.wikimedia.org/T76392#981502 (10yuvipanda) Well, it's down to 1 now. Just T78076 left [11:02:52] 3Beta-Cluster, Release-Engineering: Reduce [LOCAL HACK] changes on Beta Cluster to zero - https://phabricator.wikimedia.org/T76392#981504 (10yuvipanda) [11:02:54] 3Beta-Cluster, operations: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#835083 (10yuvipanda) [11:03:43] RECOVERY - Puppet failure on deployment-cache-upload02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:04:59] 3Continuous-Integration: Migrate all jobs depending on Zuul git repos out of production slaves - https://phabricator.wikimedia.org/T86659#981507 (10hashar) [11:05:31] 3Beta-Cluster, operations: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#981511 (10mark) >>! In T78076#976105, @bd808 wrote: >>>! In T78076#975352, @yuvipanda wrote: >> Why is this needed again? T76086 seems to have fixed T75206. And as @ori said, we should be agnostic about the... [11:08:21] 3Wikimedia-Labs-Infrastructure, Continuous-Integration, Labs-Team: OpenStack API account to control `contintcloud` labs project - https://phabricator.wikimedia.org/T86170#981512 (10hashar) I have created a first draft of the architecture at https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Isola... [11:10:10] RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:11:43] 3Continuous-Integration: Have unit tests of all wmf deployed extensions pass when installed together, in both PHP-Zend and HHVM (tracking) - https://phabricator.wikimedia.org/T69216#981514 (10hashar) The extensions are slowly being added to the shared jobs mediawiki-extensions-hhvm and mediawiki-extensions-zend... [11:11:54] RECOVERY - Puppet failure on deployment-cache-mobile03 is OK: OK: Less than 1.00% above the threshold [0.0] [11:13:52] 3Continuous-Integration: Have unit tests of all wmf deployed extensions pass when installed together, in both PHP-Zend and HHVM (tracking) - https://phabricator.wikimedia.org/T69216#981515 (10hashar) Moreover, the hhvm flavor of the job is available on ALL extensions in the experimental pipeline. If one want to... [11:14:12] 3Continuous-Integration: Have unit tests of all wmf deployed extensions pass when installed together, in both PHP-Zend and HHVM (tracking) - https://phabricator.wikimedia.org/T69216#981516 (10hashar) a:3hashar [11:14:58] 3Release-Engineering, Continuous-Integration: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#981519 (10hashar) [11:14:59] 3Continuous-Integration: Have unit tests of all wmf deployed extensions pass when installed together, in both PHP-Zend and HHVM (tracking) - https://phabricator.wikimedia.org/T69216#723107 (10hashar) [11:15:43] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:16:55] RECOVERY - Puppet failure on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:17:11] 3Release-Engineering, Continuous-Integration: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#981521 (10hashar) The hhvm job is voting for mediawiki/core mediawiki/vendor. Some selected extensions have voting hhvm tests via a job share... [11:17:59] RECOVERY - Puppet failure on deployment-upload is OK: OK: Less than 1.00% above the threshold [0.0] [11:18:48] 3Release-Engineering, operations, Continuous-Integration: Let us customize Zuul metrics reported to statsd - https://phabricator.wikimedia.org/T1369#981523 (10hashar) I have no spare cycles to implement the feature in Zuul. That is straight python, should not be too hard for anyone to realize it. [11:26:34] 3MediaWiki-Core-Team, Continuous-Integration, MediaWiki-Configuration: Update jenkins for extension registration changes - https://phabricator.wikimedia.org/T86359#981543 (10hashar) The jobs that tests multiple extensions together ( mediawiki-extensions-{hhvm,zend} ), clone a varying list of extension to the wor... [11:59:43] !log removed ferm from all beta hosts via salt [11:59:46] Logged the message, Master [12:23:15] 3Beta-Cluster, Release-Engineering, Deployment-Systems: beta-scap-eqiad fails due to ssh-add not finding mwdeploy ssh key - https://phabricator.wikimedia.org/T86901#981614 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Unified into /home/mwdeploy now. Puppet is putting ssh keys there. It is shared across insta... [12:32:54] <_joe_> !log installing the new HHVM package on mediawiki hosts [12:32:57] Logged the message, Master [12:37:35] PROBLEM - SSH on deployment-lucid-salt is CRITICAL: Connection refused [12:40:35] sigh [12:40:41] this is going to keep complaining every day [12:40:41] now [12:43:45] <_joe_> !log added hhvm.pcre_cache_type = "lru" to beta hhvm config [12:43:49] Logged the message, Master [12:47:47] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [12:49:54] <_joe_> I am on it btw [12:51:22] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [12:53:46] <_joe_> mh this has healed [13:01:20] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:12:49] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [14:09:03] 3Phabricator: Fix search in Wikimedia Phabricator - https://phabricator.wikimedia.org/T75854#981836 (10Aklapper) Search is now entirely broken and has become unusable for me. https://phabricator.wikimedia.org/maniphest/query/9QXVDeQDxlB3/#R Contains Words: openstreetmap osm labs First result for me is https:/... [14:10:10] 3Continuous-Integration: Zuul-cloner failing to acquire lock sometimes ("IOError: Lock for file .git/config did already exist, lock is illegal") - https://phabricator.wikimedia.org/T86734#981837 (10hashar) [14:10:11] 3Continuous-Integration: [fixed, pending upstream merge] Git clone corruption by mediawiki-extensions-hhvm job on integration-slave1006 - https://phabricator.wikimedia.org/T86730#981838 (10hashar) [15:55:38] I am off, see you on Tuesday [17:20:10] having an odd issue, trying to debug a maint script thats failing on beta. I can log into deployment-mediawiki* and they have hhvm (for hhvm -m debug usage) but no mwscript. I can log into deployment-bastion and it has mwscript, but no hhvm(so can't debug) [17:20:18] should i file a ticket to have hhvm installed on deployment-bastion, or something else? [17:21:03] hrmmm, probably, Reedy do we have hhvm on tin? [17:21:32] at some point we'll need hhvm on tin so we can do repo authoritative mode, at least (afaict) [17:21:50] in prod i would use terbium rather than tin, looks like tin doesn't have hhvm [17:22:09] * ebernhardson still cant spell...some day [17:22:34] oh right re terbium [17:23:35] now the choice of making another server to mimic terbium or to pack it on top of -bastion [17:23:47] the faidon in me says (I think) to make another vm [17:24:23] ebernhardson: can you file a bug about this, plz? [17:24:30] greg-g: sure [17:24:32] ty [17:27:57] tin and terbium are still on precise [17:30:29] 3Release-Engineering: Create a terbium clone for the beta cluster - https://phabricator.wikimedia.org/T87036#982188 (10EBernhardson) 3NEW [17:30:37] greg-g: ^ [17:31:02] legoktm: eek [17:31:12] ebernhardson: so, you might not be able to do it in prod either? [17:31:13] hmm, your right tebrium doesn't have it either(i forgot i've been using either osmium or mw1017) [17:31:19] ahhhh [17:31:33] which.. is kind of wrong, but ok :P [17:31:35] osmium is often a crap shoot though, it doesn't use the packaged version of hhvm its some special compile [17:31:41] yeah [17:31:44] ori told me to use mw1017 instead of osmium because osmium was broken :P [17:31:57] osmium == ori/tim/etc's playground [17:32:09] don't tell ops that [17:32:11] :) [17:32:50] 3Phabricator: Fix search in Wikimedia Phabricator - https://phabricator.wikimedia.org/T75854#982199 (10Chad) I'm pretty sure it's fallout from T75743. I don't think it's working quite as well in practice as it did in testing. [17:33:47] 3Release-Engineering: Create a terbium clone for the beta cluster - https://phabricator.wikimedia.org/T87036#982202 (10greg) Note: terbium (and tin) are both on Precise, thus don't have hhvm either. Erik has been using mw1017 in prod for this use case. We should upgrade terbium as well. [17:43:09] wow, I would really like it if "file:^foo" searches worked in gerrit. "secondary index must be enabled for file:^foo" [17:47:12] chrismcmahon: Does it work in Phabricator? [17:48:10] 3Beta-Cluster, Release-Engineering, MediaWiki-extensions-Flow, Collaboration-Team: beta-update-databases-eqiad failing on enwiki - https://phabricator.wikimedia.org/T86934#982241 (10EBernhardson) still looking into this, the above fixed one problem but there is still something else happening that is triggered by... [17:49:53] 3Release-Engineering, MediaWiki-Developer-Summit-2015, Continuous-Integration: 2015 MediaWiki Developer Summit - State of continuous integration (CI), what we want to do in 2015 - https://phabricator.wikimedia.org/T86752#982247 (10BGerstle-WMF) Is this specific to Mediawiki, or CI infrastructure across the org? [17:50:57] James_F: I hadn't thought to try, but Phab search doesn't seem to be that granular. [17:54:44] 3Phabricator: Goal: The majority of WMF developer teams and sprints have moved to Phabricator - https://phabricator.wikimedia.org/T825#982258 (10Awjrichards) Congratulations! This is a big accomplishment. [18:05:29] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: File[/home/mwdeploy/.ssh] is already declared in file /etc/puppet/modules/mediawiki/manifests/users.pp:59; cannot redeclare at /etc/puppet/modules/beta/manifests/scap/master.pp:17 on node i-0000010b.eqiad.wmflabs [18:05:33] YuviPanda: ^ [18:05:45] That's puppet on deployment-basion [18:05:49] *bastion [18:10:39] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:20:42] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [18:21:46] * greg-g idly ponders what the "[0.0]" part of those messages mean [18:22:57] greg-g: the treshold, I think [18:24:04] 3Phabricator: Respect programmatic (custom regex) links in Bugzilla comments - https://phabricator.wikimedia.org/T850#982350 (10Aklapper) If someone wants to help pushing this forward, looking at the covered cases in the subtasks (and editing the task description) or even coming up with regexes covering these ca... [18:24:09] greg-g: but I'm not sure what 'the data' is -- probably the return code from the last N puppet runs [18:30:18] chrismcmahon: Hmm. It does have most of our repos in it now. [18:40:45] 3Phabricator: Actions shouldn't be attributed to bzimport - https://phabricator.wikimedia.org/T847#982445 (10Aklapper) p:5Low>3Volunteer? We will not be able to technically create fake accounts so we will always end up with either searching for the author if we the author has created an account here, or with... [18:45:24] out for a few hours but I'll be back [18:46:22] 3VisualEditor, MediaWiki-extensions-Flow, Continuous-Integration: Flow tests fails to run with VisualEditor installed - https://phabricator.wikimedia.org/T86920#982507 (10Mattflaschen) p:5Triage>3Normal [18:55:16] 3Release-Engineering, MediaWiki-Developer-Summit-2015, Continuous-Integration: 2015 MediaWiki Developer Summit - State of continuous integration (CI), what we want to do in 2015 - https://phabricator.wikimedia.org/T86752#982567 (10GWicke) @hashar, T86372 is a discussion about our general CI / deployment strategy... [18:56:48] 3Beta-Cluster: Cannot open VE in Betalabs , throwing error Error loading data from server: 503: parsoidserver-http: HTTP 503 - https://phabricator.wikimedia.org/T86951#982577 (10Ryasmeen) and again. [18:56:57] 3Beta-Cluster: Cannot open VE in Betalabs , throwing error Error loading data from server: 503: parsoidserver-http: HTTP 503 - https://phabricator.wikimedia.org/T86951#982578 (10Ryasmeen) 5Resolved>3Open [18:58:38] 3Beta-Cluster, Parsoid-Team: Cannot open VE in Betalabs , throwing error Error loading data from server: 503: parsoidserver-http: HTTP 503 - https://phabricator.wikimedia.org/T86951#982582 (10greg) p:5Triage>3Unbreak! [18:59:35] 3Beta-Cluster, Parsoid-Team: Cannot open VE in Betalabs , throwing error Error loading data from server: 503: parsoidserver-http: HTTP 503 - https://phabricator.wikimedia.org/T86951#980108 (10greg) >>! In T86951#980477, @Krenair wrote: > Parsoid service on deployment-parsoid05 had stopped for some reason, @Arlol... [19:26:36] bd808: just for you, I updated the WMF Deployments gcal for the next two weeks already :) [19:27:00] * greg-g goes to get an early lunch [19:27:57] all-hands is a great time for unscheduled deploys ;) [19:28:42] 3Beta-Cluster, Parsoid-Team: Cannot open VE in Betalabs , throwing error Error loading data from server: 503: parsoidserver-http: HTTP 503 - https://phabricator.wikimedia.org/T86951#982664 (10Krenair) I don't think it's the same issue: ```krenair@deployment-parsoid05:~$ service parsoid status parsoid start/runni... [20:02:22] PROBLEM - Puppet failure on deployment-stream is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:10:41] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:11:13] ..... [20:12:43] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:17:11] (03PS1) 10Dduvall: Releasing major version (in alpha) 1.0.0-alpha.1 [selenium] - 10https://gerrit.wikimedia.org/r/185503 [20:17:31] (03CR) 10jenkins-bot: [V: 04-1] Releasing major version (in alpha) 1.0.0-alpha.1 [selenium] - 10https://gerrit.wikimedia.org/r/185503 (owner: 10Dduvall) [20:18:36] I am spacing out now .. but how do I log onto the deploy-parsoid* hosts? what is the bastion to login to first? [20:18:45] 3Beta-Cluster: Create a terbium clone for the beta cluster - https://phabricator.wikimedia.org/T87036#982863 (10greg) [20:19:06] 3Beta-Cluster: Create a terbium clone for the beta cluster - https://phabricator.wikimedia.org/T87036#982868 (10greg) p:5Triage>3Normal [20:19:47] Krenair, ^ [20:19:53] (03PS2) 10Dduvall: Releasing major version (in alpha) 1.0.0.pre.1 [selenium] - 10https://gerrit.wikimedia.org/r/185503 [20:20:31] oh rubygems... not quite semver friendly [20:20:43] subbu: deployment-bastion [20:20:43] subbu, bastion.wmflabs.org [20:20:50] yeah, that first ^ [20:20:56] ok. [20:21:08] then deployment-parsoid05 [20:21:13] (I guess you don't need to use deploy-bastion) [20:21:28] i had already tried bastion.wmflabs.org [20:21:34] ah .. 05 .. i was trying 04 . that explains it [20:21:39] 04 was the old one [20:21:45] that got broken and rebuilt by Roan IIRC [20:22:03] got it. [20:23:40] so yeah somehow we've gone through 4 nodes already and it's still broken on the 5th :p [20:23:49] :) [20:23:57] i have look up puppet settings to see where parsoid logs are stored here. [20:25:56] /srv/deployment/parsoid/deploy/conf/wmf/betalabs.localsettings.js mentions "Direct logs to logstash via bunyan and gelf-stream." [20:26:12] LOGSTASH_HOSTNAME='deployment-logstash1.eqiad.wmflabs' [20:26:15] Krenair, nothing I see in /data/project/parsoid/parsoid.log [20:26:22] logs seem normal. [20:28:23] Krenair, i can open a page in VE on betalabs .. [20:28:24] http://en.wikipedia.beta.wmflabs.org/w/index.php?title=0.921072816063511&veaction=edit [20:28:28] no errors. [20:28:40] hm. [20:28:54] it was erroring for me and greg-g earlier [20:29:09] yeah, the whole "503, wanna try again?" error [20:29:13] let me go back and look in the logs there and see if i find anything there. [20:29:13] and, I just got it again [20:29:18] http://en.wikipedia.beta.wmflabs.org/wiki/0.4113324952512356_Moved?veaction=edit [20:29:27] Error loading data from server: 503: parsoidserver-http: HTTP 503. Would you like to retry? [20:29:36] * greg-g bbian, phone call [20:29:44] yes, i got it on that page. [20:29:59] * greg-g just hits random article, then edit [20:31:38] 3Beta-Cluster, Parsoid-Team: Cannot open VE in Betalabs , throwing error Error loading data from server: 503: parsoidserver-http: HTTP 503 - https://phabricator.wikimedia.org/T86951#982889 (10Arlolra) I didn't fix anything last time. I just did `service parsoid status` and showed that it was running. Then Krenai... [20:33:25] 64 bytes from deployment-parsoid05.eqiad.wmflabs (10.68.16.120): icmp_req=13 ttl=64 time=0.688 ms [20:33:49] why does OSM think that it's still building? [20:34:53] Krenair, none of the requests are being logged in /data/project/parsoid/parsoid.log or in logstash. so, wonder where the requests are going to. [20:38:22] subbu: It's possible that logstash is dropping the udp stream. It's also possible that new firewall rules somewhere are blocking [20:39:39] subbu: On deployment-logstash01, lookup the udp port in /etc/logstash/conf.d/... and then run `sudo tcpdump udp port $POST` to see if things are making it to the logstash server but not being indexed [20:39:49] Krenair, can you confirm that VE config looks right to you? [20:40:00] $POST was meant to be $PORT ;) [20:40:05] bd808, but, logging is enabled for on-disk logging as well .. and i don't see anything in /data/project/parsoid/parsoid.log [20:40:14] both disk and logstash [20:40:17] oh. not my fault then :) [20:40:56] i don't think things are making past the cache [20:40:59] https://phabricator.wikimedia.org/T86951#982889 [20:41:14] the ip doesn't appear to be assigned here https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000006d8.eqiad.wmflabs [20:42:36] ah .. cache would explain why some pages don't 503 and the rest 503. [20:42:52] and i can confirm that on the page that opened in VE, when i make an edit and try to review changes, i get an error. [20:43:49] yeah so VE gets to the cache which can serve some stuff it already has, but fails to connect for the rest [20:43:57] 3Release-Engineering, MediaWiki-Developer-Summit-2015, Continuous-Integration: 2015 MediaWiki Developer Summit - State of continuous integration (CI), what we want to do in 2015 - https://phabricator.wikimedia.org/T86752#982911 (10hashar) >>! In T86752#982247, @BGerstle-WMF wrote: > Is this specific to Mediawiki... [20:44:42] it's trying to shuttle to 10.68.16.120 which no longer seems to be assigned to deployment-parsoid05 [20:45:14] well I can ping it [20:45:29] and SSHing to that IP gives a shell @deployment-parsoid05:~$ [20:45:52] that page you found on labswiki has some other issues, such as status (it's not building still) [20:46:05] or at least it shouldn't be [20:46:59] there are some puppet failures on deployment-* nodes, but nothing jumps out as obvious to me: http://shinken.wmflabs.org/problems [20:47:36] need someone with ops skills to debug this .. but requests aren't getting to parsoid after cache failures .. [20:48:24] YuviPanda: help if you have time ^ :/ [20:49:26] not sure if there were any changes to varnish config. [20:49:45] Krenair: where did you try ssh'ing from? [20:49:53] bastion1 [20:49:58] cause [20:49:58] arlolra@deployment-parsoidcache02:~$ ssh 10.68.16.120 [20:49:58] ssh: connect to host 10.68.16.120 port 22: Connection timed out [20:50:37] sounds like ferm? [20:50:44] I don't think I'm able to ssh to parsoidcache02... [20:51:20] ah, nope, there it is. I was going from the wrong server, oops [20:54:19] yeah I can't get from parsoidcache02 to parsoid05 via ssh either [20:56:37] arlolra, although I don't think that's allowed from any other non-bastion instance, is it? [20:57:59] idk [20:58:02] should I be able to [20:58:03] arlolra@deployment-parsoidcache02:/etc$ curl http://10.68.16.120:8000 [20:58:03] curl: (7) Failed to connect to 10.68.16.120 port 8000: Connection timed out [20:58:23] 3Beta-Cluster, MediaWiki-Core-Team, operations: Create a terbium clone for the beta cluster - https://phabricator.wikimedia.org/T87036#982926 (10hashar) Seems this should go to #operations , #hhvm and #mediawiki-core-team and be rephrased to: "convert work machine (tin, terbium) to Trusty and hhvm usage" + me... [20:59:39] arlolra: sounds like a firewall failure [20:59:47] well, over zealousness [21:01:31] I wonder if https://wikitech.wikimedia.org/wiki/Special:NovaSecurityGroup is related [21:01:53] (I can't see what that's like for deployment-prep, just a project member not admin) [21:02:08] bd808: sorry.. but, can you point arlolra where to look for that ferm rule change? [21:03:11] * subbu --> coffee shop and back online in 15 mins [21:04:19] * bd808 saw Yuvi change that in email [21:05:37] arlolra, greg-g: YuviPanda broke it possibly -- https://gerrit.wikimedia.org/r/#/c/185428/2 [21:06:54] ty [21:07:19] ah [21:12:40] bd808 probably called it way back with: [21:12:40] 15:38 < bd808> subbu: It's possible that logstash is dropping the udp stream. It's also possible that new firewall rules somewhere are blocking [21:15:53] the commit msg says "That means that some of the ferm rules are no longer required!" ... but then only removed the parsoid one, which seems to still be required(?). should I submit a revert or is there some other way it should be fixed? [21:16:55] arlolra: comment on it describing the problem you're seeing? cc the people who reviewed/merged it onto the new bug? all of the above probably [21:17:22] will do. thanks for the hlep [21:17:25] help too [21:19:55] 3Beta-Cluster, Parsoid-Team: Cannot open VE in Betalabs , throwing error Error loading data from server: 503: parsoidserver-http: HTTP 503 - https://phabricator.wikimedia.org/T86951#982978 (10Arlolra) Looks like this was caused by https://gerrit.wikimedia.org/r/#/c/185428/ [21:21:21] 3Beta-Cluster, Parsoid-Team: Cannot open VE in Betalabs , throwing error Error loading data from server: 503: parsoidserver-http: HTTP 503 - https://phabricator.wikimedia.org/T86951#982981 (10hashar) CAN WE PLEASE FILL NEW BUGS????? Thx :) Heading to http://en.wikipedia.beta.wmflabs.org/wiki/1stdecemberchrome?... [21:24:07] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#982985 (10greg) [21:24:48] 3Phabricator, Collaboration-Team: Trello migration script - https://phabricator.wikimedia.org/T821#982989 (10Spage) [21:25:16] 3Beta-Cluster, Parsoid-Team: Cannot open VE in Betalabs , throwing error Error loading data from server: 503: parsoidserver-http: HTTP 503 - https://phabricator.wikimedia.org/T86951#982990 (10hashar) @akosiaris and @yuvipanda did the ferm cleanup earlier today, though it is unlikely they are still around at that... [21:29:57] 3Beta-Cluster, Parsoid-Team: Monitor the Parsoid backend service on beta cluster - https://phabricator.wikimedia.org/T87063#982997 (10hashar) 3NEW [21:30:20] 3Beta-Cluster, Parsoid-Team: Cannot open VE in Betalabs , throwing error Error loading data from server: 503: parsoidserver-http: HTTP 503 - https://phabricator.wikimedia.org/T86951#983009 (10hashar) Request to add monitoring of the Parsoid service on the beta cluster is T87063 [21:32:59] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#983029 (10hashar) [21:33:03] 3Phabricator, Zero, Collaboration-Team: Trello migration script - https://phabricator.wikimedia.org/T821#983030 (10Spage) [21:34:21] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#982188 (10hashar) [21:34:55] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#982188 (10hashar) Thanks Greg. I have added some steps to the task description. I could not find a project/Task related to the Trusty migration :-/ [21:49:15] (03CR) 10Hashar: "check experimental" [selenium] - 10https://gerrit.wikimedia.org/r/185503 (owner: 10Dduvall) [21:52:19] (03PS1) 10Hashar: mediawiki/selenium to support rspec [integration/config] - 10https://gerrit.wikimedia.org/r/185550 [21:53:42] (03CR) 10Hashar: "Awesome! I would like the Jenkins rspec job to be triggered on patch proposal and +2, I have proposed https://gerrit.wikimedia.org/r/#/c/" [selenium] - 10https://gerrit.wikimedia.org/r/185503 (owner: 10Dduvall) [21:57:53] .... [21:58:08] where is the bot that ban me from this channel on friday past 10pm ? [21:58:30] just came around to wish you all a good week-end. Will be in SF office on Tuesday morning. [21:59:01] hashar: we will have beers! [21:59:16] maybe not tuesday morning... but sitll [21:59:18] for sure [21:59:39] and thanks for your last mail on the releng list :]  It is lovely [21:59:55] that makes me quite happy to work! [22:00:12] hashar: :) [22:00:32] hashar: thanks! ^ [22:00:51] g'night hashar ! [22:03:27] and please gently slap robla for me :-] [22:03:46] I blame him to have started a huggeeee thread I will feel committed to read over the week-end *grin* [22:03:49] rest well! [22:05:14] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#983124 (10EBernhardson) [22:05:59] 3Beta-Cluster, MediaWiki-Core-Team, operations: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#982188 (10EBernhardson) updated description again, to clarify that the scripts don't have any dependency on hhvm, it is being used for its gdb like debug consol... [23:26:23] !log cherry-picked https://gerrit.wikimedia.org/r/#/c/185570/ to fix puppet errors on deployment-prep [23:26:28] Logged the message, Master [23:26:41] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:38:24] 3Continuous-Integration: Make sure everyone in RelEng has sudo on integration slaves in labs - https://phabricator.wikimedia.org/T86779#983318 (10greg) a:5greg>3None unlicking for now, was over ambitious the week before all hands.. [23:41:40] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [23:44:29] (03CR) 10Dduvall: [C: 031] "Yes, please!" [integration/config] - 10https://gerrit.wikimedia.org/r/185550 (owner: 10Hashar) [23:46:30] PROBLEM - Free space - all mounts on deployment-cache-upload02 is CRITICAL: CRITICAL: deployment-prep.deployment-cache-upload02.diskspace._srv_vdb.byte_percentfree.value (<100.00%) [23:47:50] bd808: bah! i've tried to download your mw-v iso a few times but it keeps crapping out at some point [23:56:02] bd808: "curl: (18) transfer closed with 1490582208 bytes remaining to read" [23:56:19] boo [23:56:28] bd808: if you has a moment to give me access to the instance and file maybe i can just scp it [23:56:37] or i can wait until tuesday :) [23:56:41] yeah... hang on