[01:07:27] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 30.00% of data above the critical threshold [0.0]
[01:32:22] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0]
[05:49:35] <wikibugs>	 10Quarry: Add list of query executions to the query page side-bar - https://phabricator.wikimedia.org/T100982#1328147 (10Abarcomb) This request arouse out of a discussion at a workshop. In my drawing, I saw modifications of the same query as branches, while completely new queries formed new top-level nodes. Of c...
[05:57:04] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1404 is CRITICAL 22.22% of data above the critical threshold [0.0]
[06:27:01] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1404 is OK Less than 1.00% above the threshold [0.0]
[06:43:08] <shinken-wm>	 PROBLEM - Puppet failure on tools-submit is CRITICAL 22.22% of data above the critical threshold [0.0]
[07:13:10] <shinken-wm>	 RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0]
[08:44:10] <wikibugs>	 10Tool-Labs-tools-Other, 10Analytics: Work on Metrics tools wm-metrics and MediaCollectionDB, refactoring and code quality. - https://phabricator.wikimedia.org/T100710#1328387 (10Qgil)
[08:44:21] <wikibugs>	 10Tool-Labs-tools-Other, 10Analytics: Work on Metrics tools wm-metrics and MediaCollectionDB, refactoring and code quality. - https://phabricator.wikimedia.org/T100710#1328388 (10Qgil) a:3JeanFred
[08:56:29] <wikibugs>	 6Labs, 10Labs-Infrastructure, 6operations, 3Labs-Sprint-100: Make a block-level copy of the codfw mirror of labstore1001 to eqiad - https://phabricator.wikimedia.org/T101010#1328391 (10mark) (Sparse) block level copying of the thin volumes started between the systems:   ``` pv -eprab /dev/mapper/store-now_...
[09:46:04] <wikibugs>	 6Labs, 5Patch-For-Review: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1328461 (10jcrespo) @Springle Checking with you, because you were restructuring the backups/dbstore1:  https://gerrit.wikimedia.org/r/#/c/214994/ https://gerrit.wikimedia.org/r/#/c/214993/
[12:27:13] <wikibugs>	 6Labs: Clean up, delete 'openmeeting' project - https://phabricator.wikimedia.org/T101039#1328646 (10Matanya)
[12:27:46] <wikibugs>	 6Labs: Clean up, delete 'openmeeting' project - https://phabricator.wikimedia.org/T101039#1327280 (10Matanya) Please don't delete video project, turns out the jessie does serve my needs. Thanks!
[12:36:10] <andrewbogott>	 YuviPanda: Krenair: Can I get a +1 for https://gerrit.wikimedia.org/r/#/c/215317/ ?
[12:37:03] <Krenair>	 we've presumably tested a few instances without use_dnsmasq, right?
[12:37:28] <andrewbogott>	 Krenair: yes — my bastion has been running without it for a month...
[12:37:39] <andrewbogott>	 I just now turned it off for all the other bastions.
[12:38:37] <Krenair>	 done
[12:38:49] <YuviPanda>	 yup
[12:39:39] <Krenair>	 I've always thought it was kind of strange that I could look up wmnet hosts from labs
[12:39:46] <Krenair>	 But I guess since it's all in the public operations/dns repo...
[12:40:17] <andrewbogott>	 Krenair: yeah, and I think you still can after this change.  It’s weird but harmless.
[12:40:23] <Krenair>	 yep
[12:40:27] <andrewbogott>	 Actually, I’ll make a task to disable it; it’s easy.
[12:41:26] <wikibugs>	 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328687 (10Andrew) 3NEW a:3Andrew
[12:41:48] <wikibugs>	 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328696 (10Andrew)
[12:42:51] <wikibugs>	 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328701 (10yuvipanda) Note that we want to resolve at least some things - labmon1001.eqiad.wmnet, labsdb100*.eqiad.wmnet, and I'm sure there'll be more things in the future.
[12:45:36] <wikibugs>	 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328705 (10Andrew) really?  It's useful to resolve those IPs even though we can't route to them?
[12:47:07] <wikibugs>	 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328714 (10yuvipanda) You can route to them! We have several holes in that firewall.
[12:47:49] <wikibugs>	 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328716 (10Andrew) 5Open>3declined ok, should just leave this as is then.  Easy!
[12:48:13] <wikibugs>	 10Tool-Labs, 3Labs-Sprint-100, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Clean up local crontabs - https://phabricator.wikimedia.org/T96472#1328719 (10yuvipanda) This is done, and I moved the following tools' crontabs to tools-submit with appropriate jsubs:  # reportsbot # revibot # revibot-i # revibot-ii # r...
[12:48:31] <YuviPanda>	 andrewbogott: :D
[12:48:34] <Krenair>	 heh, ok
[12:51:18] <wikibugs>	 10Tool-Labs, 3Labs-Sprint-100, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Clean up local crontabs - https://phabricator.wikimedia.org/T96472#1328729 (10yuvipanda) 5Open>3Resolved a:3yuvipanda
[12:52:00] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Sprint-100: Move tools to designate - https://phabricator.wikimedia.org/T96641#1328739 (10yuvipanda)
[12:52:02] <wikibugs>	 10Tool-Labs: Move toollabs to designate - https://phabricator.wikimedia.org/T100023#1328740 (10yuvipanda)
[12:52:36] <wikibugs>	 6Labs, 10Tool-Labs: Move tools to designate - https://phabricator.wikimedia.org/T96641#1223096 (10yuvipanda)
[12:53:08] <wikibugs>	 10Tool-Labs, 3Labs-Sprint-100: Move toollabs to designate - https://phabricator.wikimedia.org/T100023#1303692 (10yuvipanda)
[12:53:15] <wikibugs>	 10Tool-Labs, 3Labs-Sprint-100: Move toollabs to designate - https://phabricator.wikimedia.org/T100023#1303692 (10yuvipanda) Bastions done.
[12:54:14] <wikibugs>	 10Tool-Labs: Get rid of tools-trusty bastion - https://phabricator.wikimedia.org/T101094#1328758 (10yuvipanda) 3NEW
[12:56:27] <YuviPanda>	 !log tools switched labs webproxies to designate, forcing puppet run and restarting nscd
[12:56:30] <labs-morebots>	 Logged the message, Master
[12:56:45] <YuviPanda>	 andrewbogott: do you think we should / could stress test the new DNS resolvers?
[12:57:10] <andrewbogott>	 maybe, although I don’t have a good idea as to how.
[13:00:20] <YuviPanda>	 andrewbogott: google lists quite a few of 'em
[13:00:39] <andrewbogott>	 that would be a script that we run from w/in labs?
[13:00:53] <sDrewth>	 YuviPanda: did legoktm poke you about the error message received at https://tools.wmflabs.org/checker ?
[13:02:01] <YuviPanda>	 andrewbogott: yea. like https://github.com/jedisct1/dnsblast or http://pentestmonkey.net/tools/misc/dns-grind
[13:02:05] <sDrewth>	 and that there is a redirect error at https://tools.wmflabs.org/checker/
[13:02:08] <YuviPanda>	 sDrewth: vaguely. can you file a task on phab?
[13:02:14] <sDrewth>	 k
[13:02:26] <andrewbogott>	 YuviPanda: ok, sure, would be good to get some stats comparing the old and new.  Want to try?
[13:02:34] <YuviPanda>	 andrewbogott: yeah let me do 
[13:02:59] <wikibugs>	 10Tool-Labs, 3Labs-Sprint-100: Move toollabs to designate - https://phabricator.wikimedia.org/T100023#1328781 (10yuvipanda) And webproxies done!   Doing a stress test now.
[13:03:48] <YuviPanda>	 andrewbogott: labs-recursor01.wikimedia.org right?
[13:03:56] <andrewbogott>	 yep
[13:04:15] <YuviPanda>	 andrewbogott: hmm
[13:04:20] <YuviPanda>	 yuvipanda@tools-bastion-02:~/dnsblast$ ping labs-recursor01.wikimedia.org
[13:04:23] <YuviPanda>	 ping: unknown host labs-recursor01.wikimedia.org
[13:04:23] <YuviPanda>	 wat
[13:05:34] <andrewbogott>	 clearly I don’t remember the name
[13:05:36] * andrewbogott looks
[13:05:47] <andrewbogott>	 labs-recursor0
[13:05:58] <YuviPanda>	 which host is it on, btw?
[13:06:06] <andrewbogott>	 holmium
[13:06:15] <andrewbogott>	 but the recursor has its own ip
[13:06:27] <andrewbogott>	 if you hit holmium directly you’ll be hitting a different dns server
[13:10:14] <wikibugs>	 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1328816 (10Billinghurst)
[13:11:53] <Coren>	 YuviPanda: afaict, the thing that prevented the shadow from starting its master is the same thing that broke the master to begin with.  I'm about to forciby try a failover by killing master, unless you desperately need it working atm.
[13:12:00] <wikibugs>	 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1328807 (10Billinghurst) @mzmcbride adding to ticket
[13:15:06] <wikibugs>	 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1328831 (10yuvipanda) When did this stop working?
[13:15:32] <YuviPanda>	 Coren: yeah, go for it
[13:16:17] <wikibugs>	 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1328834 (10Billinghurst) Looks to be by 15 May [1]  [1] https://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#Page_Checker_gadget_-_page_is_not_directing_properly_._._._again
[13:17:37] <YuviPanda>	 Coren: can you !log your restarts / other master related actions here? thanks.
[13:17:54] <Coren>	 !log tools killing the sge_qmaster to test failover
[13:17:57] <labs-morebots>	 Logged the message, Master
[13:23:22] <Coren>	 !log tools stracing the shadowd to see what's up; master is down as expected.
[13:23:25] <labs-morebots>	 Logged the message, Master
[13:23:49] <Coren>	 Tests made annoying because the damn thing does hardcoded sleep(60).  *grumble*
[13:25:10] <YuviPanda>	 catchpoint caught this outage as well
[13:25:10] * YuviPanda feels quite happy about that
[13:25:10] <YuviPanda>	 should get you guys accounts when chase is back online
[13:25:11] <Krenair>	 am I missing an email from catchpoint?
[13:25:28] <YuviPanda>	 Krenair: this doesn't go to ops@ yet
[13:25:30] <Krenair>	 or is this not something that goes to ops?
[13:25:30] <Krenair>	 ok
[13:25:34] <YuviPanda>	 we haven't quite figured out where to make it go to
[13:25:37] <YuviPanda>	 this is toollabs specific
[13:25:39] <wikibugs>	 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs - https://phabricator.wikimedia.org/T100885#1328863 (10Tobi_WMDE_SW)
[13:26:15] <wikibugs>	 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs - https://phabricator.wikimedia.org/T100885#1328865 (10yuvipanda) +1 would be useful :)
[13:27:50] <wikibugs>	 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs - https://phabricator.wikimedia.org/T100885#1328871 (10Hydriz) The dumps have always existed under `/data/scratch/wikidata`and `/data/scratch/wikibase` for people to use....
[13:28:15] <Coren>	 !log tools sge_shadowd started a new master as expected, after /two/ timeouts of 60s (unexpected)
[13:28:18] <labs-morebots>	 Logged the message, Master
[13:28:42] <Coren>	 Hm.
[13:29:02] <Coren>	 So it works, but takes no less than 2 minutes despite the current settings which should be 30s.
[13:29:20] <wikibugs>	 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1328877 (10MZMcBride) Thanks for filing this, @Billinghurst.  Over the weekend, I tried all sorts of commands to try to revive the tool, but none of them were successful. I also looked at various logs in...
[13:29:21] <Coren>	 Ah, and there are... DNS issues?
[13:30:03] <Coren>	 server host resolves rdata host "tools-bastion-01.tools.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)"
[13:31:04] <wikibugs>	 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs - https://phabricator.wikimedia.org/T100885#1328882 (10Addshore)
[13:31:11] <Coren>	 That's odd.  Resolves properly from bastions, but not tools-shadow
[13:31:51] <Coren>	 ohwait. Why is the resolv.conf different on those hosts?
[13:32:04] <Coren>	 andrewbogott: ^^ part of your current wip?
[13:33:20] <wikibugs>	 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs - https://phabricator.wikimedia.org/T100885#1328891 (10Addshore) @Hydriz Ahh, it's awesome that they exist there!  Yes it would be great if they were moved!
[13:33:35] <wikibugs>	 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1328892 (10Addshore)
[13:33:46] <YuviPanda>	 Coren: the bug the last time was that /etc/hosts was too big and it couldn't quite read it
[13:34:24] <wikibugs>	 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1322765 (10Addshore)
[13:34:31] <YuviPanda>	 oh, right. bastions were switched earlier today, I wonder if that's causing them to send in different hostname info?
[13:34:34] <Coren>	 Yeah, right now the issue seems to be different; it looks like the dns server the shadow resolv.conf points to really can't resolve.
[13:35:34] <Coren>	 YuviPanda: It does.  I just noticed - the new DNS gives the full new fqdn which would be okay if the master and shadow could resolve it.
[13:35:45] <YuviPanda>	 Coren: we could switch the master / shadows too
[13:35:51] <YuviPanda>	 was going to do that later anyway
[13:35:56] <YuviPanda>	 should I switch now? 
[13:36:12] <Coren>	 Hm.  Lemme check to make sure what the impact of the names changing would be.
[13:36:23] <YuviPanda>	 ok
[13:36:25] <Coren>	 May need to update the gridengine config first.
[13:36:28] <YuviPanda>	 Coren: note that the old names will still resolve
[13:36:50] <Coren>	 Yeah, but the clients still identify themselves with the new names.  That should only affect admin hosts though.
[13:37:15] <Coren>	 (exec hosts don't care what their names are so long as the master can reach them)
[13:38:06] <Coren>	 But also, this is a change from today so really shouldn't affect why you didn't get failover last week.
[13:38:15] <andrewbogott>	 Coren: (probably you’ve caught up with this already)  dnsmasq-using hosts can’t resolve name.project.eqiad.wmflabs; that’s strictly a feature of the new setup.
[13:38:43] <Coren>	 andrewbogott: I know; this is clearly what's going on now.
[13:39:31] <Coren>	 YuviPanda: As expected, the switchover works fine from tools-submit which didn't get the new dns.
[13:39:40] <YuviPanda>	 right
[13:39:53] <Coren>	 YuviPanda: So if I add the new names to the config and we switchover the masters things should work.
[13:39:56] * Coren does that now.
[13:40:01] <YuviPanda>	 ok
[13:42:13] <Coren>	 Oy vey.  Circular dependency.  Can't add a host unless it resolves, can't switch resolver until hosts are added.
[13:42:19] * Coren duct tapes over the issue.
[13:46:45] <Coren>	 Yeah, that switchover is going to be "fun".  Heh.  I have enough duct tape applied, we need to switch master and shadow to the new scheme now.
[13:47:18] <Coren>	 Then I'll be able to finish the config changes - I manually added tools-bastion-01.tools.eqiad.wmflabs as admin host.
[13:47:21] <YuviPanda>	 Coren: want me to do the dns switchover?
[13:47:39] <Coren>	 YuviPanda: Please.  I'll +2 your patch since you already know exactly what to do.
[13:48:17] <YuviPanda>	 Coren: nah, config change in wikitech. already done, you can force a puppet run to get new resolv.conf
[13:48:32] <Coren>	 Thankfully none of this should affect running jobs.
[13:49:22] <shinken-wm>	 PROBLEM - Puppet failure on tools-redis is CRITICAL 100.00% of data above the critical threshold [0.0]
[13:49:38] <Coren>	 Annoying, but the switch to the new names is a big win and worth the trouple.
[13:49:55] <YuviPanda>	 the old names still work tho
[13:50:23] <Coren>	 The issue isn't that the old names don't work but that the clients identify themselves with the new one.
[13:50:40] <YuviPanda>	 ah, hmm
[13:50:49] <Coren>	 Ah.  And now the tools-bastion-01 can talk to the current master properly.
[13:51:03] <Coren>	 So I am now able to add the new names.
[13:52:27] <Coren>	 !log tools new-style names for gridengin admin hosts added
[13:52:30] <labs-morebots>	 Logged the message, Master
[13:53:09] <YuviPanda>	 !log tools moved tools-master / shadow to designate
[13:53:12] <labs-morebots>	 Logged the message, Master
[13:54:22] <Coren>	 !log tools adding new-style names for submit hosts
[13:54:24] <labs-morebots>	 Logged the message, Master
[13:55:00] <YuviPanda>	 Coren: can you list the submit hosts? I should switch them over to designate too
[13:56:29] <Coren>	 YuviPanda: qconf -ss gives you the list, but wait until I'm done adding the new ones first.
[13:56:36] <YuviPanda>	 Coren: ah ok
[13:57:18] <YuviPanda>	 that's a lot of hosts
[13:57:24] <YuviPanda>	 it might just be easier to turn it on toolswide
[13:57:32] <Coren>	 Probably.
[13:58:05] <YuviPanda>	 I'll wait till you've added all the new ones, I guess
[13:58:09] <Coren>	 That list actually needs cleanup too; once the switch is done.  Right now, I'm keeping both series of names.
[13:58:34] <wikibugs>	 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1328951 (10Magnus) 3NEW
[14:00:00] <wikibugs>	 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1328972 (10yuvipanda) I just rebooted the host, let's see if that fixes it.
[14:03:30] <wikibugs>	 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1328982 (10Magnus) Doesn't look like it: magnus@tools-bastion-01:~$ ssh wdq-mm-02.eqiad.wmflabs ssh: connect to host wdq-mm-02.eqiad.wmflabs port 22: Connection refused
[14:03:36] <Coren>	 YuviPanda: I need to test the impact on exec nodes before we proceed further.
[14:03:46] * Coren designates tools-exec-catscan as the victim.
[14:03:53] <YuviPanda>	 ok
[14:04:23] <YuviPanda>	 Coren: empty the  variable use_dnsmasq in the configure page for the tool and force a puppet run and it should switch
[14:06:41] <Coren>	 Gotcha.
[14:07:51] <Coren>	 Thankfully, gridengine never relies on reverse dns
[14:19:52] <YuviPanda>	 andrewbogott: hmm, something looks off with wdq-mm-02
[14:20:07] <YuviPanda>	 andrewbogott: as in a reboot doesn't seem to have actually rebooted it.
[14:20:10] <YuviPanda>	 let me try again, actually
[14:20:44] <YuviPanda>	 Coren: how's it going?
[14:21:06] <Coren>	 YuviPanda: Not sure yet.  Still in the middle of prelim testing.
[14:21:38] <YuviPanda>	 ok
[14:22:30] <bblack>	 what's killing wikibugs?
[14:27:07] <wikibugs>	 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1329554 (10yuvipanda) I see an `@app.route('')` in the code that is causing:  ```Traceback (most recent call last):   File "/data/project/checker/www/python/src/app.py", line 139, in <module>     @app.rou...
[14:27:37] <YuviPanda>	 bblack: > UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 231: ordinal not in range(128)
[14:27:42] <wikibugs>	 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1329557 (10Magnus) I just restarted the wdq-mm service on wdq-mm-01.eqiad.wmflabs; while it was stopped, I couldn't get the web site, which probably means wdq-mm-02.eqiad.wmflabs is not serving. That works for now, but I'd...
[14:28:05] <bblack>	 heh
[14:28:19] <bblack>	 so it's because someone put a non-ascii character in a commit title, basically?
[14:28:26] <YuviPanda>	 possibly. I'm not sure
[14:28:32] <YuviPanda>	 I think it's handled those well before
[14:30:03] <valhallasw>	 andrewbogott: I might be late to the party, but the DNS change is breaking SGE
[14:30:07] <valhallasw>	 "error: commlib error: access denied (client IP resolved to host name "tools-submit.tools.eqiad.wmflabs". This is not identical to clients host name "tools-submit.eqiad.wmflabs")"
[14:30:22] <andrewbogott>	 valhallasw: I think Coren and YuviPanda are working on that
[14:30:29] <valhallasw>	 ok, great!
[14:30:59] <YuviPanda>	 Coren: did you add new name for tools-submit?
[14:31:36] <Coren>	 YuviPanda: I have.  I'm trying to figure out why -submit isn't behaving like the bastions.
[14:31:44] <YuviPanda>	 ok
[14:32:18] <Coren>	 Oh, ew.  It's gotta be all-or-nothing.
[14:32:42] * Coren curses.
[14:33:02] <andrewbogott>	 YuviPanda: btw, can you please revive virt1005 sometime soon?  It doesn’t make much of a lifeboat at the moment.
[14:33:06] <Coren>	 YuviPanda: I think we better rip the band-aid off and deal with the fallout.
[14:33:36] <andrewbogott>	 YuviPanda: moritz is making some security changes and… it would be nice to have a test box
[14:33:38] <YuviPanda>	 Coren: alright, let me turn it on everywhere.
[14:34:35] <YuviPanda>	 !log tools turned off dnsmasq for toollabs
[14:34:36] <YuviPanda>	 Coren: ^
[14:34:38] <labs-morebots>	 Logged the message, Master
[14:34:48] * Coren forces a puppet run on -submit
[14:40:08] <YuviPanda>	 Coren: that seems to have fixed it
[14:40:44] <Coren>	 Yeah, that's expected.  I'm trying to see how things fare for the exec nodes now.
[14:40:54] <YuviPanda>	 ok
[14:44:30] <wikibugs>	 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1329624 (10yuvipanda) The node looks kind of dead :| I'll ping @andrew to investigate - if not we can always rebuild it!
[14:45:12] <Coren>	 afaict, everything is working fine atm.
[14:45:32] <Coren>	 Hm.  I should say "seems to be"
[14:45:49] <Coren>	 So as to not look too foolish if things explode later.  :-)
[14:46:02] <YuviPanda>	 heh
[14:46:11] <YuviPanda>	 Coren: should we make them all use their new hostnames?
[14:46:18] <YuviPanda>	 I didn't expect the hostnames themselves to change :|
[14:46:26] <YuviPanda>	 hostname -f gives the newer project scoped name
[14:46:49] <andrewbogott>	 hashar, Krinkle: I’m writing down an upgrade path for projects that use self-hosted puppet with clients.  one step involves changing a line in /etc/puppet/puppet.conf by hand on each client.  Is that acceptable?  (Of course you can use sed and/or salt depending on your setup and preferences)
[14:52:45] <wikibugs>	 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1329659 (10yuvipanda) ^ was on doing a webservice uwsgi-python start, which is what @legoktm told me this was running on.
[14:54:06] <Coren>	 Oh, hell.
[14:54:20] <wikibugs>	 10Tool-Labs-tools-Other: Tool checker throwing errors - https://phabricator.wikimedia.org/T101097#1329664 (10yuvipanda)
[14:54:55] <Coren>	 YuviPanda: The exec node names changing means a restart of the shepherds break.  Means every host needs to be readded and queue lists changed.
[14:55:06] * Coren does that now.
[14:55:33] <YuviPanda>	 Coren: so can we force a puppet run, and then just add the newly genrated file definitions for exec nodes that puppet puts out?
[14:56:19] <Coren>	 Hm no, we don't want to actually _add_ those hosts to the queue as this would do.  I need to add, manually edit to queues to rename, then delete.
[14:56:31] <Coren>	 There is no support for renaming.
[14:56:37] <YuviPanda>	 Coren: oh, because then they'd kind of forget what jobs they're currently running?
[14:56:52] <Coren>	 (As this is insanely baroque and unusual)
[14:57:34] <Coren>	 YuviPanda: I don't know - I only tried (on purpose) on an empty exec node.  I'm pretty sure it'd be all bad.
[14:57:46] <YuviPanda>	 yeah, I won't be surprised
[14:58:03] <YuviPanda>	 Coren: another option is to make the output of hostname -f go back to what it used to be
[14:58:06] <Coren>	 But since the change only takes effect on restaring the shepherd, that means the node is restarted anyways.
[14:58:08] <YuviPanda>	 which is without the project prefix
[14:58:17] <YuviPanda>	 this we can do with an /etc/hosts entry
[14:58:50] <Coren>	 I've already fixed everything but exec nodes, and we'd just be setting ourselves up for a bigger fail down the line I think.
[14:59:00] <Coren>	 Better suffer a bit now than set us up a bomb.
[14:59:03] <YuviPanda>	 yeah, I agree
[14:59:32] <Coren>	 And yes, that's a ayb reference.  It's old enough that it's retro cool by now.  :-)
[15:06:41] <YuviPanda>	 Coren: hah :)
[15:06:50] <YuviPanda>	 Coren: keep !logging and let me know if there's something I can do to hlep
[15:08:49] <Coren>	 I'm still trying to figure out the least painful way forward.
[15:09:05] <YuviPanda>	 ok
[15:21:34] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 100.00% of data above the critical threshold [0.0]
[15:25:14] <Coren>	 Hm.  It might be possible to trick the master.
[15:27:34] <YuviPanda>	 Coren: how so?
[15:27:35] <wikibugs>	 10Tool-Labs, 10pywikibot-core, 5Patch-For-Review: Support Debian package python-ipaddr - https://phabricator.wikimedia.org/T100603#1329806 (10jayvdb) >>! In T100603#1316531, @jayvdb wrote: > Travis whitelist request posted: https://github.com/travis-ci/travis-ci/issues/3973  no response yet.
[15:28:27] <Coren>	 YuviPanda: I'm testing now, but it /looks/ like an unrestarted shepherd will work with the new name.
[15:28:40] <Coren>	 So we can rename hosts without having to restart them
[15:28:48] <YuviPanda>	 Coren: by 'shepherd' do you mean the gridengine-exec process?
[15:28:52] * Coren nods.
[15:29:27] <YuviPanda>	 Coren: does restarting them actually cause gridengine to lose state info?
[15:30:07] <Coren>	 You mean while there are jobs?
[15:31:47] <YuviPanda>	 yeah
[15:32:15] <Coren>	 I expect it would because the sge_execd is actually the parent of the sge_shepgerd processes.  Alternately, the execd might just kill its children (which would also make sense).
[15:33:35] <Coren>	 I don't think we have a better solution than disable-evacuate-"rename"-enable
[15:34:08] <YuviPanda>	 Coren: ugh
[15:34:17] <Coren>	 Ima gonna get started on that then.
[15:34:31] <YuviPanda>	 Coren: we usually have straggler processes that take days to drain
[15:34:53] <Coren>	 Hm.
[15:34:54] <YuviPanda>	 Coren: we could still do the /etc/hosts 'hack'.
[15:35:03] <YuviPanda>	 Coren: and then we have more breathing room to do the move arounds.
[15:35:21] <YuviPanda>	 the hack is also trivially puppetizable
[15:35:22] <wikibugs>	 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1329841 (10Andrew) 3NEW a:3thcipriani
[15:35:53] <Coren>	 I'm not sure how that'd help.  The actual amount of work is exactly the same, we're just introducing more confusion for longer.
[15:36:14] * Coren ponders.
[15:36:42] <Coren>	 Alternately, I can add all the "new" hosts
[15:36:54] <Coren>	 But leave them disabled.
[15:37:09] <Coren>	 As we evacuate and restart, they'll be picked up automatically.
[15:37:10] <YuviPanda>	 restarting all jobs is somewhat disruptive and we shouldn't do that if possible, I think
[15:37:36] <Coren>	 So we can still spread things over a longer period without introducing a hack.
[15:38:16] <wikibugs>	 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1329862 (10thcipriani) p:5Triage>3High
[15:38:28] <Coren>	 So, we evacuate a fraction, and wait until they drain
[15:38:37] <Coren>	 however long that takes.
[15:38:50] <Coren>	 Any exec node that is not restarted will keep on working.
[15:39:09] <YuviPanda>	 took us more than  a week last time
[15:39:20] <Coren>	 Sure, but nothing would break in the meantime.
[15:42:58] <Coren>	 Either way, need to add the "new" nodes.
[15:43:24] <Coren>	 !log tools adding the "new" exec nodes (aka, current nodes with new names)
[15:43:28] <labs-morebots>	 Logged the message, Master
[15:46:39] <hashar>	 andrewbogott: yeah we can do the manual step for sure.
[15:46:53] <hashar>	 andrewbogott: there is even a salt master for beta and integration labs project. Not sure how well it work though
[15:47:00] <andrewbogott>	 hashar: great.  I’ll send an email or something before I break anything.
[15:49:45] <wikibugs>	 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1329841 (10mmodell) this task is binary.
[15:53:56] <Coren>	 Ah, 1401 only has a single, restartable job on it.  Using it as guniea pig
[15:54:15] <Coren>	 !log tools Switching names for tools-exec-1401
[15:54:18] <labs-morebots>	 Logged the message, Master
[16:04:21] <valhallasw>	 I'm still getting SGE errors on submission
[16:04:24] <valhallasw>	 https://tools.wmflabs.org/gerrit-reviewer-bot/
[16:04:28] <valhallasw>	 (at the bottom)
[16:04:34] <Coren>	 valhallasw: I am aware.  Working on it.
[16:06:38] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1410 is OK Less than 1.00% above the threshold [0.0]
[16:08:28] <Coren>	 .. there seems to be something intermitently odd about dns.
[16:10:34] <YuviPanda>	 on tools-submit? that's weird
[16:10:53] <YuviPanda>	 !log tools restart nscd on tools-submit
[16:10:57] <labs-morebots>	 Logged the message, Master
[16:11:04] <valhallasw>	 note that it's talking to -shadow, not to -master
[16:11:10] <YuviPanda>	 oh right
[16:11:18] <YuviPanda>	 yes that's current master
[16:11:28] <valhallasw>	 hm, okay.
[16:11:55] <YuviPanda>	 valhallasw: that was part of https://phabricator.wikimedia.org/T90546
[16:12:23] <valhallasw>	 I see
[16:18:16] <YuviPanda>	 valhallasw: how does https://etherpad.wikimedia.org/p/bigbrother-almost-no-more sound?
[16:18:28] <YuviPanda>	 valhallasw: I should probably add a note about not touching the service.manifests
[16:19:11] <Coren>	 Gah.
[16:19:28] <wikibugs>	 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1330124 (10scfc) https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster#Puppetmaster_Setup explicitly says:  > […] When you...
[16:20:11] <Coren>	 I've no idea wtf is going on.  Resolution seems to be random.
[16:20:36] <Coren>	 !log tools switching back to tools-master
[16:20:39] <labs-morebots>	 Logged the message, Master
[16:26:46] <YuviPanda>	 Coren: probably related to the huge /etc/hosts file still on tools-shadow - same issues as last time, I guess
[16:26:47] * YuviPanda purges it
[16:26:52] <YuviPanda>	 Coren: tools-master should be fine
[16:27:10] <YuviPanda>	 !log tools cleaned out /etc/hosts file on tools-shadow
[16:27:13] <labs-morebots>	 Logged the message, Master
[16:30:41] <Coren>	 YuviPanda: tools-master has the same issue, seemingly.
[16:31:04] <YuviPanda>	 oh wow
[16:31:13] <YuviPanda>	 so we're back to same spot as last week's outage then...
[16:31:35] <YuviPanda>	 yuppp
[16:32:07] <Coren>	 And so, in the end, unrelated to /etc/hosts
[16:32:12] <YuviPanda>	 yes
[16:32:23] <YuviPanda>	 well, *maybe*
[16:32:31] * Coren attempts to debug
[16:33:07] <Coren>	 Hm.  The DNS inconsistency is really playing hell with gridengine.
[16:34:29] <YuviPanda>	 Coren: I still think we should put in a /etc/hosts hack that should help with hostname -f output
[16:34:31] <Coren>	 It almost looks as though the resolver gets random names for the same IP
[16:34:41] <YuviPanda>	 and then we don't actually have to touch anything for a while?
[16:34:47] <YuviPanda>	 unless I'm misunderstanding the problme
[16:35:31] <Coren>	 YuviPanda: It's likely to just make matters worse; now you'll have /etc/hosts contradicting dns and the resolver is likely to get no better for it.
[16:35:55] <Coren>	 The issue predates the dns switch for the bastions, we're just seeing it more.
[16:35:59] <YuviPanda>	 Coren: it shouldn't contradict dns, no? like, hitting tools-master.eqiad.wmflabs will also return the correct IP
[16:36:28] <Coren>	 yes, but reversing the ip won't match the forward host name and while the /clients/ don't care, the master checks.
[16:37:24] <Coren>	 But right now, the master is getting inconsistent values.  I'm trying to track down how exactly.
[16:37:29] <valhallasw>	 YuviPanda: edited email a bit
[16:37:30] <YuviPanda>	 ok
[16:37:37] <YuviPanda>	 valhallasw: looking
[16:43:27] <Coren>	 YuviPanda: We may have to cheat after all; gridengine *really* doesn't like hostname changing on the exec nodes where jobs are running.  Either we have to restart everything pretty much at once or we have to cheat the names with a hosts hack, and restart everything with intermediate changes to /etc/hosts.
[16:43:56] <wikibugs>	 6Labs, 10Beta-Cluster: Make sure labs ENC knows about cert changes - https://phabricator.wikimedia.org/T101124#1330255 (10Andrew) 3NEW a:3Andrew
[16:44:25] <Coren>	 Guh.  That thing was *really* not meant to have its host names switched.
[16:44:35] <chasemp>	 YuviPanda: there was some ticket that alex worked on about nat and labs vm's
[16:44:45] <chasemp>	 does that relate to https://phabricator.wikimedia.org/T95714 do you know?
[16:45:01] <YuviPanda>	 Coren: /etc/hosts hack gets my vote
[16:45:37] <Coren>	 YuviPanda: It's only a partial band-aid; we'll also have to do the whole thing by hand anyways.  But it'll get things back to running for now at least.
[16:45:50] <YuviPanda>	 Coren: what do you mean by 'whole thing by hand'?
[16:46:27] <Coren>	 evacuate-dequeue-change entry in hosts file-delete old node-requeue.  For every exec node.
[16:46:36] <wikibugs>	 10Tool-Labs-tools-Other: Tool checker throwing errors - https://phabricator.wikimedia.org/T101097#1330278 (10Legoktm) 5Open>3Resolved a:3Legoktm So...I added the `@app.route('')` while debugging and forgot to remove it :/  Somehow the webserver was started as lighttpd instead of uwsgi-python, and something...
[16:46:54] * Coren ponders
[16:47:06] <Coren>	 We don't have enough room to just create new nodes, I think.
[16:47:13] <Coren>	 (That'd also work)
[16:47:35] <YuviPanda>	 Coren: if we just add the /etc/hosts entry, won't all the current nodes report the same hostname -f as they used to be, and just go an as usual?
[16:48:34] <Coren>	 They're report with the same fqdn, sure, so it'll wallpaper over the current issue.  But it's brittle and will need to be changed.
[16:48:55] <YuviPanda>	 Coren: sure, but we won't have a continuing outage?
[16:49:13] <Coren>	 I know, that's why I'm working on generating it now.
[16:49:28] <YuviPanda>	 wouldn't https://gerrit.wikimedia.org/r/#/c/215355/ be all that's needed?
[16:50:07] <YuviPanda>	 or probably not.
[16:50:08] * YuviPanda test
[16:50:08] <YuviPanda>	 s
[16:50:44] <Coren>	 Heh.  No; that would not help at all.
[16:51:17] <Coren>	 The exec node needs to know its fake fqdn for its IP, and the masters need to agreee on that name for that IP as well.
[16:52:01] <andrewbogott>	 Coren, YuviPanda, sorry this is such a mess :(
[16:52:17] <Coren>	 andrewbogott: Had to be done eventually.
[16:52:29] <jgage>	 hi. after the upcoming puppetmaster/dns changes, will labs puppet client certs be based on the fqdn, or still the ec2id?
[16:52:44] <YuviPanda>	 Coren: I thought gridengine doesn't use rdns?
[16:53:14] <Coren>	 YuviPanda: So I thought; the clients don't but the master checks since auth is name-based (makes sense, actually)
[16:53:28] <YuviPanda>	 oh, sigh. right..
[16:53:55] <Coren>	 YuviPanda: ~marc/hh contains what should be in the hosts file
[16:54:21] <wikibugs>	 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1330304 (10Andrew) Yes, it works.  From puppet:            # else assume short hostname and append domain.           default     => "${...
[16:54:33] <YuviPanda>	 Coren: ah, and this should be on the master / shadow hosts files?
[16:54:55] <Coren>	 YuviPanda: Yes, so that it'll match that the exec nodes currently self-report
[16:55:12] <YuviPanda>	 Coren: alright.
[16:55:19] * YuviPanda stands by, doesn't fiddle with anything
[16:55:34] <YuviPanda>	 I guess we should puppetize that.
[16:55:38] * YuviPanda is unsure how exactly to
[16:55:52] <Coren>	 The only alternative I see is to really just restart everything on the "new" names.
[16:56:12] <Coren>	 But even then, that's going to cause reporting issues because the old nodes are unable to report in atm.
[16:56:24] <YuviPanda>	 Coren: we can restart all the restartable jobs and then let the running tasks just continue, I guess?
[16:56:53] <Coren>	 Sure, but if we do that we can't use the new name for the exec node until it's completely drained.
[16:57:06] * Coren ponders.
[16:57:27] <wikibugs>	 6Labs, 10Beta-Cluster: Make sure labs ENC knows about cert changes - https://phabricator.wikimedia.org/T101124#1330344 (10Andrew) Bitrotted patch: https://gerrit.wikimedia.org/r/#/c/202790/
[16:57:32] <Coren>	 First things first, lemme band-aid the hosts files so that the grid is back up.
[16:57:36] <YuviPanda>	 ok
[17:01:25] <Coren>	 I'm still not getting why resolving atm randomly gets the old or new name.
[17:01:51] <YuviPanda>	 how are you testing it?
[17:02:08] <Coren>	 YuviPanda: I'm not, I'm just speaking to what gridengine reports.
[17:02:21] <YuviPanda>	 ah, that was same as last week's outage then :)
[17:03:35] <wikibugs>	 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1330403 (10Tobi_WMDE_SW)
[17:06:07] <Coren>	 YuviPanda: Oooooh.  There may be a bit of trickery we can use that's actually builtin to gridengine that I didn't know about.
[17:06:42] <YuviPanda>	 go on?
[17:06:50] <YuviPanda>	 Coren: did the hosts fil hack fix gridengine for now?
[17:06:56] <YuviPanda>	 (seems to have)
[17:06:57] <Coren>	 YuviPanda: Only partly.
[17:07:29] <Coren>	 It's still randomly unreliable.  But I think I found a better workaround.
[17:07:38] <wikibugs>	 10Tool-Labs, 3Labs-Sprint-100: Clean up huge logs on toollabs - https://phabricator.wikimedia.org/T98652#1330427 (10yuvipanda) cluebot's error log grew to 1.5T again, deleted and notifying maintainer (https://en.wikipedia.org/wiki/User_talk:Rich_Smith#Cluebot.27s_error_logs)
[17:07:42] * Coren reads docs, will emerge shortly.
[17:08:21] <Coren>	 aha!
[17:10:09] <YuviPanda>	 chasemp: hey! no, the reason prod can't access labs is because there's a big firewall there explicitly forbidding all access
[17:10:31] <YuviPanda>	 we have punched some very specific holes (graphite.wmflabs.org and labsdb) but outside of that they're segregated
[17:10:41] <YuviPanda>	 chasemp: nothing to do with the NAT stuff
[17:10:47] <chasemp>	 right and that seems like a bigger question than this task is asking 
[17:10:50] <chasemp>	 ok
[17:12:34] <YuviPanda>	 chasemp: yeah. lots of things gotta change to do that, I guess.
[17:12:53] <Coren>	 YuviPanda: http://gridscheduler.sourceforge.net/htmlman/htmlman5/host_aliases.html
[17:13:14] <YuviPanda>	 hahahaa
[17:13:15] <YuviPanda>	 wow
[17:13:17] <YuviPanda>	 nice :)
[17:13:23] <YuviPanda>	 so that should fix the problem, I hope
[17:14:04] <Coren>	 I've added an alias file with every exec and submit host.
[17:14:22] <Coren>	 Yeay strace while trying to debug the issue.
[17:14:46] <Coren>	 Which allowed me to see it trying to open that file and go "hmmm.  'host_aliases'.  whu?"
[17:15:27] <Coren>	 So if we see more issues, checking that file has the alias should be first step
[17:18:22] <Coren>	 Ack.  almost.
[17:21:53] <Coren>	 YuviPanda: Still one issue with a partially munged queue (catscan).  About to fix then we should be all set.
[17:22:53] <YuviPanda>	 Coren: ok. we should still figure out a long term solution, though. that might just be rebuilding instances slowly, *again*. (maybe this time we can drop the tools- prefix!)
[17:23:03] <YuviPanda>	 Coren: should also puppetize the aliases file, I think.
[17:23:54] <Coren>	 YuviPanda: Probablty.  I'll wait until things are fully back up and we have a real plan.
[17:25:16] <YuviPanda>	 Coren: ok. should also file an incident report.
[17:38:28] <YuviPanda>	 Coren: back to being dead, btw
[17:38:36] <wikibugs>	 6Labs, 10Labs-Infrastructure: Labs webservice2 not working - https://phabricator.wikimedia.org/T101132#1330616 (10Magnus) 3NEW a:3yuvipanda
[17:39:06] <Coren>	 YuviPanda: I am aware, trying to fix the damn catscan queue that was only partially converted.  ima delete it entirely.
[17:39:17] <YuviPanda>	 Coren: ok
[17:39:29] <wikibugs>	 6Labs, 10Labs-Infrastructure: Labs webservice2 not working - https://phabricator.wikimedia.org/T101132#1330626 (10yuvipanda) Looks like there's a gridengine master outage again. We are looking into it.
[17:44:07] <Coren>	 ... oh no.
[17:45:40] <Coren>	 YuviPanda: You know that great solution that just fixed everything?
[17:46:32] <Coren>	 YuviPanda: That great solution just ended up effing everything up.  Badly.
[17:46:57] <YuviPanda>	 heh.
[17:47:01] <YuviPanda>	 what did it do now
[17:47:12] <Coren>	 YuviPanda: So, earlier, I created a bunch of exec nodes so that we had the new names ready to switch.
[17:47:20] <Coren>	 YuviPanda: So we had every instance twice.
[17:47:27] <YuviPanda>	 by 'created a bunch of exec nodes' you mean in gridengine config, right?
[17:47:31] <Coren>	 YuviPanda: (once under each name)
[17:47:34] <Coren>	 Yes, in ge config
[17:47:51] <Coren>	 YuviPanda: So now, with the alias file in place, when I deleted the bunch of new instances?
[17:48:08] <Coren>	 YuviPanda: It has, in fact, deleted both.  Because it thought them equivalent.
[17:48:27] * Coren headdesk.
[17:48:32] <YuviPanda>	 nice.
[17:48:47] <YuviPanda>	 so it thinks there are no exec nodes now?
[17:48:52] <Coren>	 Well, in its defence, they _were_.  explicitly.
[17:49:17] <YuviPanda>	 masater is down now, I can't seem to do anything
[17:49:22] <Coren>	 YuviPanda: In such a way that it is not even possible to actually start the master because the queues are broken.
[17:49:26] <YuviPanda>	 ah
[17:49:27] <YuviPanda>	 nice
[17:49:36] <YuviPanda>	 so
[17:49:39] <YuviPanda>	 what do we do now?
[17:49:48] <YuviPanda>	 I guess all running tasks, etc lost state?
[17:50:17] <Coren>	 YuviPanda: Actually, possibly not.  I'm looking at how to rebuild this now.
[17:50:33] <Coren>	 YuviPanda: the exec nodes are still running and keeping state until the master is back up.
[17:50:53] <YuviPanda>	 ok
[17:51:17] <Coren>	 That "simple" dns change ended up being a bit more troublesome than expected.  :-/
[17:53:58] <wikibugs>	 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1330711 (10yuvipanda) @Magnus do you mind if I just delete it and recreate it?
[18:01:04] <YuviPanda>	 Coren: I emailed labs-announce
[18:01:16] <Coren>	 YuviPanda: ty.  I'm busy pulling my hair out atm.
[18:02:05] <YuviPanda>	 alright. I'm staying away from it atm so we don't have two people poking at it at the same time :)
[18:03:52] <Coren>	 YuviPanda: Yeah dbd database recovery by timestamp.  :-)
[18:03:57] <Coren>	 bdb*
[18:05:27] <Coren>	 I still need to make sure everything is actually working right, but things seem back up.  :-)
[18:08:57] <Coren>	 YuviPanda: fwiw the gridengine host alias thing seems to have solved every problem in one fell swoop, if you ignore that temporary "omg I broke the world" bit.  :-)
[18:12:51] <YuviPanda>	 :)
[18:13:18] <YuviPanda>	 Coren: when you think things have stabilized, can you 1. puppetize the host aliases, 2. write an incident report?
[18:13:32] <kaldari>	 it is possible to use htaccess files on tool labs?
[18:13:49] <Coren>	 YuviPanda: Only fallout is that gridengine probably lost track of a couple of jobs that have started or ended in the troubled window (approx 30m) but it's unlikely that any job were started during.
[18:14:12] <Coren>	 kaldari: No, that's an apache-specific thing.  But most of what you might have wanted to do with it can be done with lighttpd (just differently)
[18:15:09] <kaldari>	 Coren: Thanks. Is there anything specific I need to know about setting it up, or should I just look up standard instructions for using lighttpd?
[18:15:13] <Coren>	 YuviPanda: Yes to both, though I expect a couple hours at least.  I'm still seeing a few wonky exec nodes and a couple odd things in the logs I want to look into.
[18:15:24] <YuviPanda>	 Coren: cool
[18:16:17] <Coren>	 kaldari: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web has recipies for the most common things; but yeah, you can also just look up normal lighttpd config doc
[18:16:40] <Coren>	 Look at 'Example configurations'
[18:28:12] <Coren>	 YuviPanda: Do you know what's up with -webgrid-lighttpd-1406?
[18:28:25] <YuviPanda>	 hmm? is it doing anything?
[18:28:57] <Coren>	 No, that being the exact issue.  :-)
[18:30:34] <Coren>	 YuviPanda: If you're not working on it, nor know of any issues, it probably just needs a reboot.
[18:30:56] <YuviPanda>	 Coren: alright. feel free to :)
[18:34:49] <Coren>	 !log tools rebooting tools-webgrid-lighttpd-1406.eqiad.wmflabs
[18:34:52] <labs-morebots>	 Logged the message, Master
[18:37:21] <SMalyshev>	 I've created a new instance for my project (wikidata-query) and it rejects my public key. Old ones work fine. Anybody knows what could be the problem?
[18:50:33] <wikibugs>	 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1331014 (10Magnus) Go ahead please!
[19:18:12] <mutante>	 labs-recursor0.wikimedia.org
[19:18:19] <mutante>	 Recursive DNS - CRIT
[19:18:28] <mutante>	 maybe known. dunno
[19:18:34] <mutante>	 just see it in Icinga
[19:21:10] <Krenair>	 andrewbogott, ^
[19:21:47] <andrewbogott>	 I don’t know what that is, but I will look
[19:32:13] <phe>	 is there a generic phabricator bug opened for missing record in replicas or must I open a new one?
[19:44:13] <Coren>	 phe: "Missing records"?
[19:48:15] <wikibugs>	 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1331309 (10JanZerebecki) Is there some automated script in some git repo that puts either these dumps there? F...
[19:48:21] <SMalyshev>	 is there a role for labs I could use to install varnish?
[20:12:25] <wikibugs>	 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1331464 (10thcipriani) Staging, Deployment-prep, and Integration projects are all updated.  No-op in all cases.      $ RUBYLIB=/var/lib...
[20:12:48] <wikibugs>	 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1331465 (10thcipriani) 5Open>3Resolved
[20:13:50] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 60.00% of data above the critical threshold [0.0]
[20:17:25] <phe>	 Coren, missing row in tables, or perhaps index broken
[20:18:57] <Coren>	 phe: Do you have a specific example in mind?  I can look into it quickly and either confirm there is an issue or give you an explanation where there isn't.  :-)
[20:19:13] <phe>	 commonswiki_p
[20:19:22] <phe>	 SELECT img_name, img_sha1 FROM image WHERE img_name like 'Rabier_-_F%';
[20:20:05] <phe>	 this File: is missing: https://commons.wikimedia.org/wiki/File:Rabier_-_Cr%C3%A9tinot,_1922.djvu
[20:20:54] <phe>	 anyway that's not the first time that sort of things occur, replicas on labs has never been reliable
[20:21:05] <Coren>	 ... your query excludes it.  :-)  You're looking for 'Rabier_-_F%' and this is 'Rabier - C' not F.  :-)
[20:21:05] <gifti>	 this would never match?
[20:21:37] <phe>	 how right :D
[20:21:53] <Coren>	 like 'Rabier_-_C%' returns it as expected.  :-)
[20:38:51] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK Less than 1.00% above the threshold [0.0]
[21:42:02] <bd808>	 !log logstash Deleted all instances; will rebuild project after labs dns switch ~2015-06-11
[21:42:07] <labs-morebots>	 Logged the message, Master
[21:52:02] <bd808>	 andrewbogott: I have an instance named shaved-yak.eqiad.wmflabs in the mediawiki-core-team project. `host shaved-yak.mediawiki-core-team.eqiad.wmflabs` is returning NXDOMAIN. This seems to be a blocker to me skipping ahead to step 4 in the dns migration process
[21:52:48] <andrewbogott>	 bd808: ok, let me find my email so I can see what step 4 is
[21:52:56] <bd808>	 "On all puppet clients, edit /etc/puppet/puppet.conf and change the puppetmaster name by inserting the project name before .eqiad.wmflabs."
[21:52:58] <andrewbogott>	 I love how when you send an email to a mailing list gmail hides it forever
[21:53:10] <bd808>	 yeah. gmail is fun like that
[21:53:32] <andrewbogott>	 mind if I log in and poke around?
[21:53:38] <bd808>	 feel free
[21:53:54] <bd808>	 my goofy test server is at your disposal :)
[21:54:16] <andrewbogott>	 your resolv.conf shows you as still using dnsmasq
[21:54:40] <andrewbogott>	 what project is this in?
[21:54:40] <bd808>	 when I try to run puppet I get "Error 400 on SERVER: stack level too deep"
[21:54:47] <andrewbogott>	 ah, ok
[21:54:59] <andrewbogott>	 if puppet can’t run then… nothing else will work.  So let’s figure out about puppet :)
[21:55:08] <bd808>	 The only change I made was removing use_dnsmasq=true
[21:55:14] <bd808>	 it was running before that :(
[21:55:35] <bd808>	 At least I think that's the only change
[21:55:48] <andrewbogott>	 did you update /var/lib/git/operations/puppet?
[21:56:16] <bd808>	 yeah. the top upstream commit is 2 hours old
[21:56:28] <bd808>	 19bf816
[21:56:48] <andrewbogott>	 private too?
[21:56:53] <andrewbogott>	 Not sure if it matters, just being thorough
[21:56:57] <bd808>	 probably not, no
[21:57:12] <andrewbogott>	 ok, I will...
[21:57:20] <bd808>	 just did it
[21:57:40] <bd808>	 caught some password change
[22:00:10] <andrewbogott>	 I restarted the puppetmaster and that seems to have fixed things.
[22:00:17] * andrewbogott  updates the docs
[22:01:01] <bd808>	 sweet. thanks
[22:02:10] <andrewbogott>	 can you take it from here?  I’m curious to hear which other parts of my checklist are wrong
[22:02:22] <bd808>	 yeah. I'll keep going
[22:02:40] <bd808>	 I suppose I'll need to upde my instance proxies too right?
[22:02:45] <bd808>	 *update
[22:03:03] <andrewbogott>	 it all depends on how they’re puppetized.
[22:03:15] <bd808>	 I mean https://wikitech.wikimedia.org/wiki/Special:NovaProxy settings
[22:03:23] <andrewbogott>	 Oh, no, that shouldn't matter.
[22:03:36] <andrewbogott>	 The old fqdn will still resolve.
[22:03:47] <bd808>	 excellent
[22:03:57] <andrewbogott>	 It’s just that $domain is different, so anything in puppet that refers to that gets confused.
[22:35:32] <wikibugs>	 6Labs, 10Labs-Infrastructure: Fix syslog error "nslcd[29117]: error writing to client: Broken pipe" - https://phabricator.wikimedia.org/T78616#1332156 (10Gage) There's also a Debian bug report discussing this: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=685504  They propose creating a dummy user in /etc/...
[22:36:16] <andrewbogott>	 bd808: any luck?
[22:36:27] <bd808>	 It seems to all be ok
[22:36:39] <andrewbogott>	 really?
[22:36:56] <bd808>	 that was a degenerate test really. it was single node host that happened to use a fqdn to reference itself
[22:37:08] <andrewbogott>	 Ah, ok.  Well, one more edge case survived :)
[22:37:10] <andrewbogott>	 Thanks.
[22:37:39] <bd808>	 deployment-prep will be the hardest migration I think
[22:40:05] <andrewbogott>	 yeah, I’ll probably schedule time with Hashar to do that.
[22:40:38] <bd808>	 I think thcipriani would be the guy to talk to
[22:40:57] <bd808>	 I think hashar has gladly forgotten how to manage the cluster ;)
[22:41:27] <wikibugs>	 6Labs, 6WMF-Legal: Request to review privacy policy and rules - https://phabricator.wikimedia.org/T97844#1332170 (10d33tah) >>! In T97844#1273445, @ZhouZ wrote: > Hi @d33tah, thank you for your contributions on WikiLabs.  The WMF legal team has reviewed your issue and has some thoughts in regards to your quest...
[22:43:30] <thcipriani>	 andrewbogott: I'd be down to do deployment-prep
[22:43:50] <thcipriani>	 I don't think it'll be too too bad :)
[22:43:52] <andrewbogott>	 thcipriani: ok, that might make more sense since we’re in the same timezone.
[22:44:06] <andrewbogott>	 We can do it any day this week, just ping me when you have a few hours.
[22:44:12] <andrewbogott>	 But not right now ‘cause I’m about to leave
[22:44:28] <thcipriani>	 heh, sure. I'll get with you tomorrow to setup some time.
[22:45:05] <andrewbogott>	 great!
[23:25:00] <ragesoss_>	 is there a good way to get commons image urls that doesn't involve scraping file pages?
[23:25:50] <ragesoss>	 Like, using some combination of the API, the file name, and the page_id?
[23:30:28] <ragesoss>	 aha!
[23:30:37] <ragesoss>	 iiprop=url