[00:07:02] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL [00:14:43] PROBLEM - Parsoid on deployment-parsoid09 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:00:26] (03PS13) 10Paladox: Support extension and skin dependacies in the skin pipeline and extension pipeline [integration/config] - 10https://gerrit.wikimedia.org/r/323540 (https://phabricator.wikimedia.org/T151593) [01:00:43] (03PS5) 10Paladox: Add a new Skin dependacies test [integration/config] - 10https://gerrit.wikimedia.org/r/323546 (https://phabricator.wikimedia.org/T151593) [01:09:48] end of an era. I just removed "scap" from my irc ping list [01:11:12] has saved the flying pig scap sticker [01:11:20] from the old laptop [01:11:34] heh. I still have a stack of them. I'll get you a new one at all-hands [01:11:39] :) [01:11:54] and an "I broke Wikipedia" sticker [01:12:05] nice [02:34:49] * twentyafterfour still hasn't broke production [02:34:57] other than phabricator ;) [04:25:56] 03Scap3, 10Parsoid: Canary doesn't rollback if you don't continue - https://phabricator.wikimedia.org/T149008#2840810 (10dduvall) [04:25:57] 03Scap3, 10Parsoid, 06Services, 15User-mobrovac: Allow failures for a percentage of targets - https://phabricator.wikimedia.org/T145512#2840811 (10dduvall) [04:26:00] 03Scap3, 10Parsoid: Rollback failed when target is down - https://phabricator.wikimedia.org/T145460#2840812 (10dduvall) [04:38:40] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Regression: doc.wikimedia.org displays "403 Forbidden" for coverage sub directories - https://phabricator.wikimedia.org/T150727#2840834 (10Krinkle) Bump. Links still broken. [04:42:59] 10MediaWiki-Codesniffer, 13Patch-For-Review: Add sniff for cast operator spacing - https://phabricator.wikimedia.org/T149544#2840836 (10Legoktm) 05Open>03Resolved [04:44:15] 10MediaWiki-Codesniffer, 13Patch-For-Review: Add SpaceBeforeControlStructureBraceSniff to enforce single space between closing parenthesis and opening brace - https://phabricator.wikimedia.org/T130004#2840839 (10Legoktm) 05Open>03Resolved a:03Legoktm [04:44:28] 10MediaWiki-Codesniffer, 13Patch-For-Review: Add SpaceBeforeControlStructureBraceSniff to enforce single space between closing parenthesis and opening brace - https://phabricator.wikimedia.org/T130004#2122233 (10Legoktm) a:05Legoktm>03Lethexie [06:31:17] (03PS3) 10Samwilson: Fix test result parsing, and correct new errors that were exposed [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/324376 (https://phabricator.wikimedia.org/T146439) [08:28:52] 10MediaWiki-Codesniffer, 13Patch-For-Review: phplint doesn't run on .inc files - https://phabricator.wikimedia.org/T116524#1751533 (10Samwilson) Can this be closed? [08:49:31] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2840996 (10Joe) So, this happened again this morning, and we have good and bad news: - Good news is the system, with a larger number of jemalloc arenas, too... [09:07:35] (03PS1) 10Hashar: fundraising/stats add experimental tox job [integration/config] - 10https://gerrit.wikimedia.org/r/324863 [09:09:49] (03CR) 10Hashar: [C: 032] fundraising/stats add experimental tox job [integration/config] - 10https://gerrit.wikimedia.org/r/324863 (owner: 10Hashar) [09:10:39] (03Merged) 10jenkins-bot: fundraising/stats add experimental tox job [integration/config] - 10https://gerrit.wikimedia.org/r/324863 (owner: 10Hashar) [10:27:46] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841144 (10Joe) Looking at `api.log`, I found that requests as follows: - to euwiki - `action=parsoid-batch` - `batch-action=preprocess` have had an absurd... [10:49:10] 10Beta-Cluster-Infrastructure, 07Puppet: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2841175 (10hashar) Apparently it is gone for real. deployment-apertium01 hasn't reappeared and does not show up in the Horizon interface. [10:51:36] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841177 (10Joe) {F4939497} shows the rate of such requests [11:23:14] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841250 (10Joe) There was an edit of Modulu:Wikidata this morning about 20 minutes before the peak of requests from parsoid happened. Not sure if that's rela... [11:32:33] (03CR) 10Hashar: "The list of dependencies are passed as $EXT_DEPENDENCIES it is then dumped in deps.txt for zuul-cloner. That part will work." [integration/config] - 10https://gerrit.wikimedia.org/r/323540 (https://phabricator.wikimedia.org/T151593) (owner: 10Paladox) [11:33:03] (03CR) 10Hashar: "That is mostly a copy paste from the ext_dependencies. I would rather have the few tests incorporated in the parent change https://gerrit" [integration/config] - 10https://gerrit.wikimedia.org/r/323546 (https://phabricator.wikimedia.org/T151593) (owner: 10Paladox) [12:19:34] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841394 (10Joe) This brust in traffic is, looking at parsoid logs, due to `reqId: 3ff62f51-cd11-4b44-98e4-6a6aa608b600` from ChangePropagation. I am unsure... [12:34:15] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2825098 (10mobrovac) >>! In T151702#2841250, @Joe wrote: > There was an edit of Modulu:Wikidata this morning about 20 minutes before the peak of requests fro... [13:13:44] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841477 (10Reedy) I've indefinitely protected (to sysop) https://eu.wikipedia.org/wiki/Modulu:Wikidata for now, and left a message at https://eu.wikipedia.or... [14:12:22] (03PS3) 10Tobias Gritschacher: Change reciepients for Wikibase browsertests [integration/config] - 10https://gerrit.wikimedia.org/r/324698 (https://phabricator.wikimedia.org/T150856) [14:31:45] (03PS1) 10Hashar: Drop mediawiki-extensions jobs that ran on permanent slaves [integration/config] - 10https://gerrit.wikimedia.org/r/324913 (https://phabricator.wikimedia.org/T135001) [14:34:51] (03CR) 10Hashar: "Jobs deleted in Jenkins" [integration/config] - 10https://gerrit.wikimedia.org/r/324913 (https://phabricator.wikimedia.org/T135001) (owner: 10Hashar) [14:34:54] (03CR) 10Hashar: [C: 032] Drop mediawiki-extensions jobs that ran on permanent slaves [integration/config] - 10https://gerrit.wikimedia.org/r/324913 (https://phabricator.wikimedia.org/T135001) (owner: 10Hashar) [14:35:39] Tobi_WMDE_SW: Guten Tag. Should I push that change of emails notifications? [14:35:51] (03Merged) 10jenkins-bot: Drop mediawiki-extensions jobs that ran on permanent slaves [integration/config] - 10https://gerrit.wikimedia.org/r/324913 (https://phabricator.wikimedia.org/T135001) (owner: 10Hashar) [14:35:56] hallo hashar! [14:36:08] yeah, that would be great! [14:36:44] doing! [14:36:57] Tobi_WMDE_SW: we should just give you +2 on that repo [14:37:04] and let you do the update :} [14:37:56] (03PS4) 10Hashar: Change recipients for Wikibase browsertests [integration/config] - 10https://gerrit.wikimedia.org/r/324698 (https://phabricator.wikimedia.org/T150856) (owner: 10Tobias Gritschacher) [14:38:11] hashar: ha, so I need to install jjb again finally.. :p [14:38:23] (03CR) 10Hashar: [C: 032] "Rebased/fixed a trivial typo in the commit message" [integration/config] - 10https://gerrit.wikimedia.org/r/324698 (https://phabricator.wikimedia.org/T150856) (owner: 10Tobias Gritschacher) [14:38:34] +1 for the +1 :) [14:38:40] erm.. for the +2 [14:38:43] Tobi_WMDE_SW: I think member of the wmde ldap group get admin access on Jenkins [14:38:53] so you probably have write access / ability to tweak jobs already [14:39:07] hashar: yes, but not the +2 on that repo obviously [14:39:15] that can be solved :D [14:39:16] (03Merged) 10jenkins-bot: Change recipients for Wikibase browsertests [integration/config] - 10https://gerrit.wikimedia.org/r/324698 (https://phabricator.wikimedia.org/T150856) (owner: 10Tobias Gritschacher) [14:40:30] !log added Tobias Gritschacher to Gerrit "integration" group so he can +2 patches on integration/* repositories \O/ [14:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:40:36] done :D [14:40:46] though you would not be able to deploy Zuul layout changes [14:40:50] that needs shell access on the server [14:41:45] Tobi_WMDE_SW: are you subscribed to the QA mailing list ? [14:42:03] hashar: yes, I am! [14:42:10] since ever.. :) [14:43:32] I have posted the announce on QA list [14:43:47] hashar: ok, thx! [14:43:59] https://www.mediawiki.org/wiki/CI/JJB should get you up to par [14:44:16] should be straightforward [14:46:47] hashar: ya, I've got it running some time ago already [14:58:03] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841655 (10Joe) I removed the bandaid right now, hoping we didn't miss the origin of the issue. I would still like the concurrency limit of Change Propagati... [15:05:33] PROBLEM - Host deployment-elastic08 is DOWN: CRITICAL - Host Unreachable (10.68.21.29) [15:16:46] (03CR) 10Hashar: "Eek sorry I havent noticed your change this morning and did it with https://gerrit.wikimedia.org/r/#/c/324863/" [integration/config] - 10https://gerrit.wikimedia.org/r/324779 (owner: 10Awight) [15:18:12] (03PS2) 10Hashar: Add noop to gate-and-submit for fundraising/stats [integration/config] - 10https://gerrit.wikimedia.org/r/324779 (owner: 10Awight) [15:18:28] (03CR) 10Hashar: [C: 032] "Rebased/fixed conflict." [integration/config] - 10https://gerrit.wikimedia.org/r/324779 (owner: 10Awight) [15:19:13] (03Merged) 10jenkins-bot: Add noop to gate-and-submit for fundraising/stats [integration/config] - 10https://gerrit.wikimedia.org/r/324779 (owner: 10Awight) [15:25:17] (03PS1) 10Hashar: test: drop unused method parameter [integration/config] - 10https://gerrit.wikimedia.org/r/324920 [15:25:34] (03CR) 10Hashar: rake: tweak files filter (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/308150 (https://phabricator.wikimedia.org/T144325) (owner: 10Hashar) [15:26:05] (03CR) 10jenkins-bot: [V: 04-1] test: drop unused method parameter [integration/config] - 10https://gerrit.wikimedia.org/r/324920 (owner: 10Hashar) [15:27:29] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841758 (10GWicke) Traditionally, a big issue causing work amplification has been a lack of reliable request timeout support in the MediaWiki API, which is t... [15:27:55] (03PS2) 10Hashar: test: drop unused method parameter [integration/config] - 10https://gerrit.wikimedia.org/r/324920 [15:28:45] (03CR) 10jenkins-bot: [V: 04-1] test: drop unused method parameter [integration/config] - 10https://gerrit.wikimedia.org/r/324920 (owner: 10Hashar) [15:29:55] (03PS3) 10Hashar: test: drop unused method parameter [integration/config] - 10https://gerrit.wikimedia.org/r/324920 [15:30:03] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841772 (10Joe) >>! In T151702#2841758, @GWicke wrote: > Traditionally, a big issue causing work amplification has been a lack of reliable request timeout su... [15:31:31] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841774 (10GWicke) > Except in this specific case changeprop/restbase fire out 23K requests for a specific transclusion in the span of less than one minute... [15:32:54] (03CR) 10Hashar: [C: 032] test: drop unused method parameter [integration/config] - 10https://gerrit.wikimedia.org/r/324920 (owner: 10Hashar) [15:33:45] (03Merged) 10jenkins-bot: test: drop unused method parameter [integration/config] - 10https://gerrit.wikimedia.org/r/324920 (owner: 10Hashar) [15:38:06] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841784 (10ssastry) >>! In T151702#2841774, @GWicke wrote: >> Except in this specific case changeprop/restbase fire out 23K requests for a specific transc... [15:39:20] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841785 (10Joe) Well, from the MediaWiki prespective, those requests come from parsoid. From the parsoid prespective, they come from ChangePropagation via r... [15:48:53] twentyafterfour hi, i doint know if this is for git on the server or server and client but git 2.11 is broken on phabricator [15:48:53] Joar, ich hab auch gerade erst gefruehstueckt. [15:48:55] Woops [15:49:02] https://secure.phabricator.com/T11940 [15:49:12] I also upgraded to it yesturday [15:55:40] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841818 (10GWicke) > So, he is saying the originator of the high concurrency rate is CP which is why I added my comment earlier about spreading out CP's requ... [15:55:59] paladox: lol @ Fruehstueck. So let us know about the upstream patch related to "malformed IPv6" with clustering [15:56:10] https://secure.phabricator.com/D16973 [15:56:24] ^^ we should backport that and we could try it on a test instance [15:56:41] but then it dosent have ipv6 on labs [15:56:59] ah, interesting. thanks [15:57:19] that does seem like it could totally be the error from yesterday [15:57:40] Yep [15:57:54] I belive twentyafterfour reported it to them, not sure though. [15:57:55] ehm.. sigh, https://secure.phabricator.com/T11939 [15:58:03] "IPv6 support" in general.. is .. open [15:58:05] not resolved [15:58:27] Yep, but the patch hasen't been merged [15:58:31] maybe it means we have to remove IPv6 from phab2001 ? [15:58:34] so we doint know what has been done and whats left [15:58:37] but that would be a little sad [15:58:44] No, we doint have to use ipv6 on those hosts [15:58:47] we can use ipv4 [15:58:48] doesnt matter for the users though.. so.. [15:59:03] yep, guess we should use ipv4 only for now [15:59:21] yes, i am just trying to introduce v6 where it works, but if we have no support here yet it.. then we dont [15:59:35] and can remove it [15:59:42] Ok [16:01:35] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841822 (10ssastry) >>! In T151702#2841818, @GWicke wrote: >> So, he is saying the originator of the high concurrency rate is CP which is why I added my comm... [16:05:09] mutante i belive the patch twentyafterfour uploaded should prevent the clustering from trying ipv6 https://gerrit.wikimedia.org/r/#/c/324851/1 [16:08:56] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841841 (10GWicke) [16:10:01] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841846 (10Paladox) Users are reporting problems with watchlist in #wikipedia-en [16:11:22] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841850 (10Paladox) Users have also reported it on-wiki see https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Several_technical_problems please. [16:14:04] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841851 (10GWicke) To illustrate using [the RESTBase dashboard for the outage time frame](https://grafana.wikimedia.org/dashboard/db/restbase?from=1480204640... [16:15:25] paladox: i dont think so, he said that was the exact change he made last night on phab2001 but then the issue happened [16:16:56] Oh [16:17:00] paladox: he did not configure anything IPv6, it's just that we have it working and IPv6 is preferred over v4 .. and since upstream doesnt support it fully yet then stuff happens [16:17:08] afaict [16:17:09] Oh [16:17:26] So do we remove ipv6 from those hosts?, do we remove it until ipv6 is fixed upstream. [16:17:32] that's the option i see [16:17:37] Ok [16:18:25] that means a DNS change to remove the AAAA record, a puppet change to remove the interface config and manually removing it as root [16:18:41] Oh [16:19:14] If your ok with doing that and twentyafterfour is ok with that, then we can do that please :) [16:19:31] as opposed to gerrit it is behind varnish anyways [16:19:41] not sure about the git-ssh thing though [16:19:44] Oh [16:19:58] git-ssh that can use ipv4 and ipv6 i think [16:20:06] let me check my ssh log [16:20:14] should have it in known hosts [16:20:21] unless it breaks when you activate clustering.. [16:20:53] mutante we can test that on labs now [16:20:57] since we want only ipv4 [16:21:05] that is true, yea [16:21:09] so if it breaks with ipv6 then we know we have to change something :) [16:21:24] mutante git-ssh seems to use ipv4 [16:21:25] git-ssh.wikimedia.org,208.80.154.250 [16:21:50] it's either v4 or v4+v6 ,it's never only v6 [16:22:09] Oh [16:22:18] [phab2001:~] $ host git-ssh.wikimedia.org [16:22:18] git-ssh.wikimedia.org has address 208.80.154.250 [16:22:18] git-ssh.wikimedia.org has IPv6 address 2620:0:861:ed1a::3:16 [16:22:27] both [16:22:29] Oh [16:23:15] or maybe we could go back to the rsync method, dunno [16:23:25] since we only want the clustering to get the repos synced [16:23:33] and not the other parts of it yet [16:23:41] Ok [16:23:49] needs more discussion i think [16:24:22] before we start editing DNS [16:24:48] ok [16:25:07] it is true that we could probably test the repo sync in labs with 2 instances [16:25:28] but you'd have to create test repos to sync i guess.. [16:25:47] currently you dont have any in /srv/repos right [16:25:51] Ok, i mean we can test on one instance [16:26:08] give it an ip to use and see if it fails with ipv6 [16:26:10] well, you'd need 2 to really test sync [16:26:13] 06Release-Engineering-Team, 06Operations, 06Parsing-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841877 (10matmarex) @paladox That is definitely a separate issue, not related. [16:26:13] even though we gave it ipv4 [16:26:30] Oh [16:26:37] we have phab-05 [16:26:39] the effect here is that if you add the IPs the 2 servers start talking to each other [16:26:39] and phab-01 [16:27:16] Yep [16:27:18] phabricator-01 and phabricator-02 [16:27:28] Yep, we can do phab-01 to phab-05 [16:27:32] since now you know it's just applying the role and installing mariadb [16:27:38] Oh yep [16:27:48] I only have the role applied on phabricator [16:28:01] ideally the 2 test instances both have the role [16:28:10] but maybe one that did not use it before can now start to use it [16:28:14] since it is now working [16:28:19] oh [16:28:21] yeh [16:37:53] (03CR) 10Hashar: "I have dig into it and it is not so trivial to do it cleanly. Since MediaWiki install.php detects skins and add a wfLoadSkin() automatical" [integration/config] - 10https://gerrit.wikimedia.org/r/323540 (https://phabricator.wikimedia.org/T151593) (owner: 10Paladox) [16:42:04] mutante i enabled it on phab01 [16:42:06] phab-01 [16:42:34] and it seems to complain about the addree if i do /32, if i do /16 it works (so removing ipv6 does work( [16:42:35] ) [16:43:05] "cluster.addresses": [ [16:43:05] "10.68.16.15/16" [16:43:05] ] [16:50:28] mutante i belive because we are using phabricator.wikimedia.org on both phab2001 and iridium they could colide? [16:53:53] twentyafterfour ^^ [17:02:54] paladox: wait, so /32 means "just this one IP" [17:03:02] and /16 means an entire network [17:03:05] Oh [17:04:30] i dont know the details of phab cluster.addresses yet, but testing is good [17:04:46] about the second part.. where exactly are we using the colliding name? [17:07:37] mutante phab2001 and iridium currently have the domain phabricator.wikimedia.org [17:07:52] wont phab2001 and iridium collide? [17:07:57] since they are both prod [17:08:14] 10Continuous-Integration-Config, 10puppet-compiler: Migrate Jenkins job "operations-puppet-catalog-compiler" to Jenkins Job Builder - https://phabricator.wikimedia.org/T97513#2841990 (10hashar) [17:08:16] phabricator.wikimedia.org has address 198.35.26.120 [17:08:28] 120.26.35.198.in-addr.arpa domain name pointer misc-web-lb.ulsfo.wikimedia.org. [17:08:31] ^ they dont have it [17:08:36] the misc-web varnish does [17:08:56] Yep, oh [17:09:08] and in varnish config it says "if the request is for phabricator.wm.org , send to iridium:" [17:09:32] and what we want to add is "if request is for phabricator-new, send to phab2001" [17:09:39] as the first step [17:09:53] this is the warm standby model, not real phab clustering [17:10:03] except that we use clustering to copy the repos over [17:10:11] 10Continuous-Integration-Config, 10puppet-compiler: Migrate Jenkins job "operations-puppet-catalog-compiler" to Jenkins Job Builder - https://phabricator.wikimedia.org/T97513#2842007 (10Paladox) How is the test set out? We could try and create a patch and see how we can get for example check and add... [17:10:13] yep [17:10:25] at some point in the future it can probably become a real cluster [17:10:31] yep [17:10:34] where there are just 2 backends for phabricator.wm.org [17:10:39] yep [17:10:42] like with other services that have pools [17:10:46] yep [17:11:35] if it wasnt for the repo sync, we could already do that first part without having to worry about cluster support in phab at all [17:11:42] oh [17:11:43] i think [17:11:59] well, that is what i originally thought would happen [17:12:20] Oh [17:12:21] before i knew about syncing the repos [17:12:31] and when we talked about just rsyncing [17:13:10] + [17:13:12] Woops [17:13:18] yep [17:18:03] jdlrobson: you froze [18:12:23] 03Scap3 (Scap3-MediaWiki-MVP), 03releng-201617-q2: Flatten MediaWiki config, all MediaWiki versions, and extensions into a unified git repo - https://phabricator.wikimedia.org/T147478#2842179 (10thcipriani) 05Open>03Resolved [18:12:25] 03Scap3 (Scap3-MediaWiki-MVP), 10releng-201516-q3, 10scap, 07Tracking, 07WorkType-NewFunctionality: [EPIC] Migrate the MW weekly train deploy to scap3 - https://phabricator.wikimedia.org/T114313#2842181 (10thcipriani) [18:22:53] PROBLEM - Puppet run on deployment-pdfrendertest02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:23:19] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:24:11] PROBLEM - Puppet run on deployment-mediawiki04 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:27:06] PROBLEM - Puppet run on deployment-aqs01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:27:30] PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:28:28] PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:29:08] PROBLEM - Puppet run on deployment-restbase02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:29:16] PROBLEM - Puppet run on deployment-sca01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:34:19] PROBLEM - Puppet run on deployment-tmh01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:39:00] PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [18:40:45] PROBLEM - Puppet run on deployment-restbase01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:43:14] ^ fixing [18:43:25] PROBLEM - Puppet run on deployment-mediawiki06 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:44:52] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:47:37] PROBLEM - Puppet run on deployment-parsoid09 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:50:07] PROBLEM - Puppet run on deployment-jobrunner02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:50:27] PROBLEM - Puppet run on deployment-mediawiki05 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [18:51:18] PROBLEM - Puppet run on deployment-zotero01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:52:25] PROBLEM - Puppet run on deployment-changeprop is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:52:45] PROBLEM - Puppet run on deployment-sca03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [18:53:31] PROBLEM - Puppet run on deployment-pdfrender02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:55:58] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [18:56:55] it seems that latest phabricator lost the ability to complete milestone tags [18:57:00] is this a known issue? [19:01:16] this makes it impossible to add milestones while editing / creating a task; instead, you have to add the general tag & then drag & drop it to the right column [19:06:14] 10Gerrit, 07Upstream: Show reviewer 'capability' in Reviewers table - https://phabricator.wikimedia.org/T51149#2842366 (10Paladox) This is partially fixed in https://gerrit-review.googlesource.com/#/c/87970/ [19:13:55] twentyafterfour: do you know which service is the failed one on phab2001? [19:14:07] i say that because Icinga is now talking about one being in "failed" [19:14:12] as opposed to "disabled" or so [19:14:30] it's a "check systemd state" generic check [19:14:37] so it just says "a service" [19:14:46] "units" sorry [19:46:35] mutante: paladox: cluster.addresses breaks completely with IPv6 [19:46:41] and there is no workaround yet [19:46:48] Oh [19:46:54] twentyafterfour working for me [19:47:04] https://phab-01.wmflabs.org [19:47:06] mutante: I do not know which service is failed [19:47:15] I have set cluster on ^^ [19:47:32] twentyafterfour as long as the server dosen't have ipv6 address then it should work [19:47:47] So if we do a temp removal of ipv6 from iridum and phab2001 it should work [19:48:09] twentyafterfour or we could try https://secure.phabricator.com/D16973 [19:48:09] ? [19:48:29] paladox: I don't think removing ipv6 is an option [19:48:37] Oh [19:48:48] twentyafterfour would trying that patch fix the problem? [19:49:18] mutante: oh I know what's up with icinga: systemd state is degraded because phd is not running [19:50:44] twentyafterfour i wonder can we get some how to try and force the error on phab-01? [19:50:49] so we can try the patch above [19:55:26] RECOVERY - Puppet run on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [19:56:12] RECOVERY - Puppet run on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:57:22] RECOVERY - Puppet run on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [19:57:35] RECOVERY - Puppet run on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [19:57:46] RECOVERY - Puppet run on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [19:58:33] RECOVERY - Puppet run on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:59:51] twentyafterfour: do you think we should disabled IPv6 on phab machines entirely? [20:00:09] RECOVERY - Puppet run on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:00:14] twentyafterfour: i thought phd, yea, would not have surprised me, just that it calls it "failed" instead of just "stopped" or something [20:00:42] we can make it so that this monitoring check only gets created in the first place if current server is "active server" [20:00:49] and that info from hiera [20:00:58] even better than ACKing [20:00:59] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [20:02:06] RECOVERY - Puppet run on deployment-aqs01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:02:10] twentyafterfour: is it an option to just rsync repos instead of using the cluster config to achieve the first step [20:02:54] RECOVERY - Puppet run on deployment-pdfrendertest02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:03:18] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [20:04:12] RECOVERY - Puppet run on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:07:31] RECOVERY - Puppet run on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:08:27] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [20:09:09] RECOVERY - Puppet run on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:09:17] RECOVERY - Puppet run on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:00] RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:20] RECOVERY - Puppet run on deployment-tmh01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:17:59] mutante: we could rsync them [20:18:18] rsync is easy :) [20:18:23] RECOVERY - Puppet run on deployment-mediawiki06 is OK: OK: Less than 1.00% above the threshold [0.0] [20:18:32] but that doesn't really achieve the desired result [20:19:03] I can patch the issue in phabricator (already almost have that done) but you're just two steps ahead of me [20:19:13] * twentyafterfour wasn't ready to roll all of this stuff out so soon [20:19:41] would it achieve the result "we could switch over to make phab2001 prod if something happened to iridium right now" ? (separate from cluster setup) [20:20:09] I mean if you've almost got a fix then no reason to rsync :) [20:20:12] i wasnt expecting the cluster part either, just the "make it work on jessie" part [20:20:40] and the "we can switch to phab2001 as prod host, so we can reinstall and upgrade and rename iridium" [20:20:44] mutante: yes indeed rsync right now would be good for disaster recovery [20:20:48] RECOVERY - Puppet run on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:20:48] and then switch back or not [20:21:14] yeah I'm fine with that [20:21:16] how about we do that first, without them caring about each other [20:21:23] and then later back to cluster config [20:21:34] I don't want to disable ipv6 I want to fix that in phab core instead [20:21:48] also, in next quarter we want to failover to codfw [20:21:51] twentyafterfour i think upstream have a fix for ipv6 [20:21:56] so that will be nice if we can just be on phab2001 then [20:22:05] also separate from cluster config [20:22:06] https://secure.phabricator.com/D16973 [20:22:06] paladox: I've been talking to upstream about it [20:22:11] Oh [20:22:12] :) [20:23:01] once they are actually phab1001 and phab2001 and both on same distro, it seems a nicer cluster anyways [20:24:52] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [20:24:53] :) [20:25:06] restored your change [20:25:07] mutante: for sure [20:25:51] thanks [20:32:09] paladox: the upstream fix is incomplete but it's 1/2 of it [20:34:27] oh [20:34:28] ok [21:08:44] ostriches twentyafterfour im trying phabricator on php 7.1 (i had to remove upstream patch that disabled php 7 support, also had to add a function exists for a function due to it using apc not avilable on php 7) [21:08:51] on my local machine [21:09:18] They say running phabricator is not supported or can work on windows, i got it working [21:10:11] using bash [21:10:53] Well, it's more "we don't pay attention to windows and support could break at any time without warning, no support will be given" [21:10:59] Same with HHVM. [21:11:02] No reason it *couldn't* work [21:11:14] Just that upstream doesn't have the cycles to care about making it work [21:11:48] ostriches actually it wont work because daemons require pcntl [21:11:56] only only on linux [21:12:17] any ways im using there offical supported distro ubuntu, (its built into windows) [21:12:23] yipee. [21:12:24] * ostriches shrugs [21:12:34] I don't use windows except for gaming. [21:12:40] Oh [21:12:47] windows 10 [21:12:48] ? [21:13:10] ostriches ^^ [21:13:57] I think so [21:14:08] Ok, yeh bash is on there [21:14:14] as long as you have the new update. [21:14:39] it's also the offical image of ubuntu so microsoft did not touch or change anything. [21:15:23] I don't need Ubuntu on my Windows, like I said I only use that PC for gaming :p [21:15:38] Yep [21:26:31] * mutante chimes in on gaming. i want https://en.wikipedia.org/wiki/MAME with https://en.wikipedia.org/wiki/Kaillera but not only was it Windows, it's also that MAME doesnt support Kaillera anymore [21:26:48] lol [21:27:11] this is "the original arcade machines like pacman, the real ROMS, but over the internet with your friends" [21:27:18] oh [21:27:26] :p [21:43:17] twentyafterfour is this the start of the pull request support comming to phabricator https://secure.phabricator.com/D16981 ? [21:47:08] never mind, thats a new release tool [22:16:34] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10Reading-Web-Trending-Service, 06Services (watching): Move primary trending service development to github - https://phabricator.wikimedia.org/T151469#2842936 (10Jdlrobson) Can we just mirror this or do we need to completely migrate? [22:47:23] (03PS2) 10Hashar: dib: remove eth1 configuration [integration/config] - 10https://gerrit.wikimedia.org/r/324744 (https://phabricator.wikimedia.org/T113342) [22:47:25] (03PS1) 10Hashar: dib: set grub timeout to 0 [integration/config] - 10https://gerrit.wikimedia.org/r/325051 (https://phabricator.wikimedia.org/T113342) [22:47:27] (03PS1) 10Hashar: dib: do not start varnish service on boot [integration/config] - 10https://gerrit.wikimedia.org/r/325052 (https://phabricator.wikimedia.org/T113342) [22:47:55] (03CR) 10Hashar: [C: 04-1] "Untested" [integration/config] - 10https://gerrit.wikimedia.org/r/325052 (https://phabricator.wikimedia.org/T113342) (owner: 10Hashar) [22:49:01] (03CR) 10Hashar: [C: 032] "Done just after the bootloader install and it works like a charm. Tested locally by booting the image in qemu." [integration/config] - 10https://gerrit.wikimedia.org/r/325051 (https://phabricator.wikimedia.org/T113342) (owner: 10Hashar) [23:16:12] 10Continuous-Integration-Infrastructure, 05Continuous-Integration-Scaling, 13Patch-For-Review: Speed up the time to get a Nodepool instances to achieve READY state - https://phabricator.wikimedia.org/T113342#2843100 (10hashar) I have created a new Jessie image with the patch that drops eth1 and the one setti... [23:18:42] (03CR) 10Hashar: [C: 032] "Tested and solves T113342" [integration/config] - 10https://gerrit.wikimedia.org/r/324744 (https://phabricator.wikimedia.org/T113342) (owner: 10Hashar) [23:19:39] (03Merged) 10jenkins-bot: dib: remove eth1 configuration [integration/config] - 10https://gerrit.wikimedia.org/r/324744 (https://phabricator.wikimedia.org/T113342) (owner: 10Hashar) [23:19:42] (03Merged) 10jenkins-bot: dib: set grub timeout to 0 [integration/config] - 10https://gerrit.wikimedia.org/r/325051 (https://phabricator.wikimedia.org/T113342) (owner: 10Hashar) [23:28:03] hashar we could publish the puppet complier code into the integration/config repo [23:28:14] not change anything but at least we can use jenkins job builder [23:28:37] to update the test in the future :) [23:29:35] paladox: yeah [23:29:38] there is a task for that [23:29:44] havent come to it though [23:29:50] Yep, im subscribed :) [23:29:55] but it should be straightforward. The job doesn't do much [23:30:03] Yep [23:30:44] I could try and see if i could do it, but i doint know what the code is. [23:30:53] Needs a copy paste from the jobs config. [23:31:04] do you have the Task number ? [23:31:08] I can copy paste the xml [23:32:07] let me quickly find it [23:32:38] hashar https://phabricator.wikimedia.org/T97513 [23:34:07] paladox: https://phabricator.wikimedia.org/P4561 :D [23:34:14] Thank you [23:34:50] 10Continuous-Integration-Config, 10puppet-compiler: Migrate Jenkins job "operations-puppet-catalog-compiler" to Jenkins Job Builder - https://phabricator.wikimedia.org/T97513#2843117 (10hashar) Job pasted on P4561 Note @elukey from operations has done a few changes to the operations-puppet-compiler Jenkins job. [23:35:26] paladox: and https://phabricator.wikimedia.org/T97081#2842033 first few lines gives an overview [23:35:28] or in short [23:35:31] add two parameters [23:35:36] Ok [23:35:39] thanks [23:35:39] execute a simple command [23:35:41] done :] [23:35:45] Yep :) [23:36:24] and valhallasw has a lot of nice experience :] [23:36:35] Oh :) [23:36:41] super friendly [23:36:48] * paladox has to convert xml to yaml [23:36:54] and is very effective / quick etc [23:38:21] i wonder how to do https://phabricator.wikimedia.org/P4561 looks different to how we do it in the repo [23:39:30] hashar lol https://github.com/ktdreyer/jenkins-job-wrecker [23:39:38] there's a tool that does this [23:40:29] there's an app for everything or tool for pc's [23:40:34] yeah :D [23:40:34] you can give it a try maybe [23:40:38] never played with it myself [23:40:52] Yep [23:40:58] we can do it for the rest of the tests [23:41:01] but really [23:41:02] if it succceds [23:41:08] you need like 3 lines for each of the parameters [23:41:12] Yep [23:41:18] and a one liner shell: whatevercommand_here [23:41:37] yep [23:41:39] probably an easy task for jj-wrecker [23:41:44] yep [23:41:47] im trying it now [23:41:50] that is all for me [23:41:51] i have to install pip [23:41:54] !! [23:42:23] paladox: you can install the packages in your homedir with: pip install --user [23:42:37] Oh [23:42:42] i can always erash [23:42:47] would put the bin in something like $HOME/.local/bin/ [23:42:50] the system if it messes something [23:42:53] which you will have to add to your path [23:43:04] as im runnign windows bash ubuntu so it dosent affect my main os [23:43:12] ex: pip install --user flake8 [23:43:17] ==> ~/.local/bin/flake8 [23:43:20] yep [23:43:37] thanks [23:43:56] have a good night! [23:44:00] and you too [23:44:06] its almost mornning [23:44:16] 23:44pm [23:47:15] Yipee it worked [23:47:21] now creating the test