[01:07:27] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 30.00% of data above the critical threshold [0.0] [01:32:22] RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0] [05:49:35] 10Quarry: Add list of query executions to the query page side-bar - https://phabricator.wikimedia.org/T100982#1328147 (10Abarcomb) This request arouse out of a discussion at a workshop. In my drawing, I saw modifications of the same query as branches, while completely new queries formed new top-level nodes. Of c... [05:57:04] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1404 is CRITICAL 22.22% of data above the critical threshold [0.0] [06:27:01] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1404 is OK Less than 1.00% above the threshold [0.0] [06:43:08] PROBLEM - Puppet failure on tools-submit is CRITICAL 22.22% of data above the critical threshold [0.0] [07:13:10] RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0] [08:44:10] 10Tool-Labs-tools-Other, 10Analytics: Work on Metrics tools wm-metrics and MediaCollectionDB, refactoring and code quality. - https://phabricator.wikimedia.org/T100710#1328387 (10Qgil) [08:44:21] 10Tool-Labs-tools-Other, 10Analytics: Work on Metrics tools wm-metrics and MediaCollectionDB, refactoring and code quality. - https://phabricator.wikimedia.org/T100710#1328388 (10Qgil) a:3JeanFred [08:56:29] 6Labs, 10Labs-Infrastructure, 6operations, 3Labs-Sprint-100: Make a block-level copy of the codfw mirror of labstore1001 to eqiad - https://phabricator.wikimedia.org/T101010#1328391 (10mark) (Sparse) block level copying of the thin volumes started between the systems: ``` pv -eprab /dev/mapper/store-now_... [09:46:04] 6Labs, 5Patch-For-Review: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1328461 (10jcrespo) @Springle Checking with you, because you were restructuring the backups/dbstore1: https://gerrit.wikimedia.org/r/#/c/214994/ https://gerrit.wikimedia.org/r/#/c/214993/ [12:27:13] 6Labs: Clean up, delete 'openmeeting' project - https://phabricator.wikimedia.org/T101039#1328646 (10Matanya) [12:27:46] 6Labs: Clean up, delete 'openmeeting' project - https://phabricator.wikimedia.org/T101039#1327280 (10Matanya) Please don't delete video project, turns out the jessie does serve my needs. Thanks! [12:36:10] YuviPanda: Krenair: Can I get a +1 for https://gerrit.wikimedia.org/r/#/c/215317/ ? [12:37:03] we've presumably tested a few instances without use_dnsmasq, right? [12:37:28] Krenair: yes — my bastion has been running without it for a month... [12:37:39] I just now turned it off for all the other bastions. [12:38:37] done [12:38:49] yup [12:39:39] I've always thought it was kind of strange that I could look up wmnet hosts from labs [12:39:46] But I guess since it's all in the public operations/dns repo... [12:40:17] Krenair: yeah, and I think you still can after this change. It’s weird but harmless. [12:40:23] yep [12:40:27] Actually, I’ll make a task to disable it; it’s easy. [12:41:26] 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328687 (10Andrew) 3NEW a:3Andrew [12:41:48] 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328696 (10Andrew) [12:42:51] 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328701 (10yuvipanda) Note that we want to resolve at least some things - labmon1001.eqiad.wmnet, labsdb100*.eqiad.wmnet, and I'm sure there'll be more things in the future. [12:45:36] 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328705 (10Andrew) really? It's useful to resolve those IPs even though we can't route to them? [12:47:07] 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328714 (10yuvipanda) You can route to them! We have several holes in that firewall. [12:47:49] 6Labs, 10Labs-Infrastructure: Don't forward .wmnet requests from the labs resolver - https://phabricator.wikimedia.org/T101090#1328716 (10Andrew) 5Open>3declined ok, should just leave this as is then. Easy! [12:48:13] 10Tool-Labs, 3Labs-Sprint-100, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Clean up local crontabs - https://phabricator.wikimedia.org/T96472#1328719 (10yuvipanda) This is done, and I moved the following tools' crontabs to tools-submit with appropriate jsubs: # reportsbot # revibot # revibot-i # revibot-ii # r... [12:48:31] andrewbogott: :D [12:48:34] heh, ok [12:51:18] 10Tool-Labs, 3Labs-Sprint-100, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Clean up local crontabs - https://phabricator.wikimedia.org/T96472#1328729 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [12:52:00] 6Labs, 10Tool-Labs, 3Labs-Sprint-100: Move tools to designate - https://phabricator.wikimedia.org/T96641#1328739 (10yuvipanda) [12:52:02] 10Tool-Labs: Move toollabs to designate - https://phabricator.wikimedia.org/T100023#1328740 (10yuvipanda) [12:52:36] 6Labs, 10Tool-Labs: Move tools to designate - https://phabricator.wikimedia.org/T96641#1223096 (10yuvipanda) [12:53:08] 10Tool-Labs, 3Labs-Sprint-100: Move toollabs to designate - https://phabricator.wikimedia.org/T100023#1303692 (10yuvipanda) [12:53:15] 10Tool-Labs, 3Labs-Sprint-100: Move toollabs to designate - https://phabricator.wikimedia.org/T100023#1303692 (10yuvipanda) Bastions done. [12:54:14] 10Tool-Labs: Get rid of tools-trusty bastion - https://phabricator.wikimedia.org/T101094#1328758 (10yuvipanda) 3NEW [12:56:27] !log tools switched labs webproxies to designate, forcing puppet run and restarting nscd [12:56:30] Logged the message, Master [12:56:45] andrewbogott: do you think we should / could stress test the new DNS resolvers? [12:57:10] maybe, although I don’t have a good idea as to how. [13:00:20] andrewbogott: google lists quite a few of 'em [13:00:39] that would be a script that we run from w/in labs? [13:00:53] YuviPanda: did legoktm poke you about the error message received at https://tools.wmflabs.org/checker ? [13:02:01] andrewbogott: yea. like https://github.com/jedisct1/dnsblast or http://pentestmonkey.net/tools/misc/dns-grind [13:02:05] and that there is a redirect error at https://tools.wmflabs.org/checker/ [13:02:08] sDrewth: vaguely. can you file a task on phab? [13:02:14] k [13:02:26] YuviPanda: ok, sure, would be good to get some stats comparing the old and new. Want to try? [13:02:34] andrewbogott: yeah let me do [13:02:59] 10Tool-Labs, 3Labs-Sprint-100: Move toollabs to designate - https://phabricator.wikimedia.org/T100023#1328781 (10yuvipanda) And webproxies done! Doing a stress test now. [13:03:48] andrewbogott: labs-recursor01.wikimedia.org right? [13:03:56] yep [13:04:15] andrewbogott: hmm [13:04:20] yuvipanda@tools-bastion-02:~/dnsblast$ ping labs-recursor01.wikimedia.org [13:04:23] ping: unknown host labs-recursor01.wikimedia.org [13:04:23] wat [13:05:34] clearly I don’t remember the name [13:05:36] * andrewbogott looks [13:05:47] labs-recursor0 [13:05:58] which host is it on, btw? [13:06:06] holmium [13:06:15] but the recursor has its own ip [13:06:27] if you hit holmium directly you’ll be hitting a different dns server [13:10:14] 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1328816 (10Billinghurst) [13:11:53] YuviPanda: afaict, the thing that prevented the shadow from starting its master is the same thing that broke the master to begin with. I'm about to forciby try a failover by killing master, unless you desperately need it working atm. [13:12:00] 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1328807 (10Billinghurst) @mzmcbride adding to ticket [13:15:06] 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1328831 (10yuvipanda) When did this stop working? [13:15:32] Coren: yeah, go for it [13:16:17] 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1328834 (10Billinghurst) Looks to be by 15 May [1] [1] https://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#Page_Checker_gadget_-_page_is_not_directing_properly_._._._again [13:17:37] Coren: can you !log your restarts / other master related actions here? thanks. [13:17:54] !log tools killing the sge_qmaster to test failover [13:17:57] Logged the message, Master [13:23:22] !log tools stracing the shadowd to see what's up; master is down as expected. [13:23:25] Logged the message, Master [13:23:49] Tests made annoying because the damn thing does hardcoded sleep(60). *grumble* [13:25:10] catchpoint caught this outage as well [13:25:10] * YuviPanda feels quite happy about that [13:25:10] should get you guys accounts when chase is back online [13:25:11] am I missing an email from catchpoint? [13:25:28] Krenair: this doesn't go to ops@ yet [13:25:30] or is this not something that goes to ops? [13:25:30] ok [13:25:34] we haven't quite figured out where to make it go to [13:25:37] this is toollabs specific [13:25:39] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs - https://phabricator.wikimedia.org/T100885#1328863 (10Tobi_WMDE_SW) [13:26:15] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs - https://phabricator.wikimedia.org/T100885#1328865 (10yuvipanda) +1 would be useful :) [13:27:50] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs - https://phabricator.wikimedia.org/T100885#1328871 (10Hydriz) The dumps have always existed under `/data/scratch/wikidata`and `/data/scratch/wikibase` for people to use.... [13:28:15] !log tools sge_shadowd started a new master as expected, after /two/ timeouts of 60s (unexpected) [13:28:18] Logged the message, Master [13:28:42] Hm. [13:29:02] So it works, but takes no less than 2 minutes despite the current settings which should be 30s. [13:29:20] 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1328877 (10MZMcBride) Thanks for filing this, @Billinghurst. Over the weekend, I tried all sorts of commands to try to revive the tool, but none of them were successful. I also looked at various logs in... [13:29:21] Ah, and there are... DNS issues? [13:30:03] server host resolves rdata host "tools-bastion-01.tools.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)" [13:31:04] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs - https://phabricator.wikimedia.org/T100885#1328882 (10Addshore) [13:31:11] That's odd. Resolves properly from bastions, but not tools-shadow [13:31:51] ohwait. Why is the resolv.conf different on those hosts? [13:32:04] andrewbogott: ^^ part of your current wip? [13:33:20] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs - https://phabricator.wikimedia.org/T100885#1328891 (10Addshore) @Hydriz Ahh, it's awesome that they exist there! Yes it would be great if they were moved! [13:33:35] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1328892 (10Addshore) [13:33:46] Coren: the bug the last time was that /etc/hosts was too big and it couldn't quite read it [13:34:24] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1322765 (10Addshore) [13:34:31] oh, right. bastions were switched earlier today, I wonder if that's causing them to send in different hostname info? [13:34:34] Yeah, right now the issue seems to be different; it looks like the dns server the shadow resolv.conf points to really can't resolve. [13:35:34] YuviPanda: It does. I just noticed - the new DNS gives the full new fqdn which would be okay if the master and shadow could resolve it. [13:35:45] Coren: we could switch the master / shadows too [13:35:51] was going to do that later anyway [13:35:56] should I switch now? [13:36:12] Hm. Lemme check to make sure what the impact of the names changing would be. [13:36:23] ok [13:36:25] May need to update the gridengine config first. [13:36:28] Coren: note that the old names will still resolve [13:36:50] Yeah, but the clients still identify themselves with the new names. That should only affect admin hosts though. [13:37:15] (exec hosts don't care what their names are so long as the master can reach them) [13:38:06] But also, this is a change from today so really shouldn't affect why you didn't get failover last week. [13:38:15] Coren: (probably you’ve caught up with this already) dnsmasq-using hosts can’t resolve name.project.eqiad.wmflabs; that’s strictly a feature of the new setup. [13:38:43] andrewbogott: I know; this is clearly what's going on now. [13:39:31] YuviPanda: As expected, the switchover works fine from tools-submit which didn't get the new dns. [13:39:40] right [13:39:53] YuviPanda: So if I add the new names to the config and we switchover the masters things should work. [13:39:56] * Coren does that now. [13:40:01] ok [13:42:13] Oy vey. Circular dependency. Can't add a host unless it resolves, can't switch resolver until hosts are added. [13:42:19] * Coren duct tapes over the issue. [13:46:45] Yeah, that switchover is going to be "fun". Heh. I have enough duct tape applied, we need to switch master and shadow to the new scheme now. [13:47:18] Then I'll be able to finish the config changes - I manually added tools-bastion-01.tools.eqiad.wmflabs as admin host. [13:47:21] Coren: want me to do the dns switchover? [13:47:39] YuviPanda: Please. I'll +2 your patch since you already know exactly what to do. [13:48:17] Coren: nah, config change in wikitech. already done, you can force a puppet run to get new resolv.conf [13:48:32] Thankfully none of this should affect running jobs. [13:49:22] PROBLEM - Puppet failure on tools-redis is CRITICAL 100.00% of data above the critical threshold [0.0] [13:49:38] Annoying, but the switch to the new names is a big win and worth the trouple. [13:49:55] the old names still work tho [13:50:23] The issue isn't that the old names don't work but that the clients identify themselves with the new one. [13:50:40] ah, hmm [13:50:49] Ah. And now the tools-bastion-01 can talk to the current master properly. [13:51:03] So I am now able to add the new names. [13:52:27] !log tools new-style names for gridengin admin hosts added [13:52:30] Logged the message, Master [13:53:09] !log tools moved tools-master / shadow to designate [13:53:12] Logged the message, Master [13:54:22] !log tools adding new-style names for submit hosts [13:54:24] Logged the message, Master [13:55:00] Coren: can you list the submit hosts? I should switch them over to designate too [13:56:29] YuviPanda: qconf -ss gives you the list, but wait until I'm done adding the new ones first. [13:56:36] Coren: ah ok [13:57:18] that's a lot of hosts [13:57:24] it might just be easier to turn it on toolswide [13:57:32] Probably. [13:58:05] I'll wait till you've added all the new ones, I guess [13:58:09] That list actually needs cleanup too; once the switch is done. Right now, I'm keeping both series of names. [13:58:34] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1328951 (10Magnus) 3NEW [14:00:00] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1328972 (10yuvipanda) I just rebooted the host, let's see if that fixes it. [14:03:30] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1328982 (10Magnus) Doesn't look like it: magnus@tools-bastion-01:~$ ssh wdq-mm-02.eqiad.wmflabs ssh: connect to host wdq-mm-02.eqiad.wmflabs port 22: Connection refused [14:03:36] YuviPanda: I need to test the impact on exec nodes before we proceed further. [14:03:46] * Coren designates tools-exec-catscan as the victim. [14:03:53] ok [14:04:23] Coren: empty the variable use_dnsmasq in the configure page for the tool and force a puppet run and it should switch [14:06:41] Gotcha. [14:07:51] Thankfully, gridengine never relies on reverse dns [14:19:52] andrewbogott: hmm, something looks off with wdq-mm-02 [14:20:07] andrewbogott: as in a reboot doesn't seem to have actually rebooted it. [14:20:10] let me try again, actually [14:20:44] Coren: how's it going? [14:21:06] YuviPanda: Not sure yet. Still in the middle of prelim testing. [14:21:38] ok [14:22:30] what's killing wikibugs? [14:27:07] 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1329554 (10yuvipanda) I see an `@app.route('')` in the code that is causing: ```Traceback (most recent call last): File "/data/project/checker/www/python/src/app.py", line 139, in @app.rou... [14:27:37] bblack: > UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 231: ordinal not in range(128) [14:27:42] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1329557 (10Magnus) I just restarted the wdq-mm service on wdq-mm-01.eqiad.wmflabs; while it was stopped, I couldn't get the web site, which probably means wdq-mm-02.eqiad.wmflabs is not serving. That works for now, but I'd... [14:28:05] heh [14:28:19] so it's because someone put a non-ascii character in a commit title, basically? [14:28:26] possibly. I'm not sure [14:28:32] I think it's handled those well before [14:30:03] andrewbogott: I might be late to the party, but the DNS change is breaking SGE [14:30:07] "error: commlib error: access denied (client IP resolved to host name "tools-submit.tools.eqiad.wmflabs". This is not identical to clients host name "tools-submit.eqiad.wmflabs")" [14:30:22] valhallasw: I think Coren and YuviPanda are working on that [14:30:29] ok, great! [14:30:59] Coren: did you add new name for tools-submit? [14:31:36] YuviPanda: I have. I'm trying to figure out why -submit isn't behaving like the bastions. [14:31:44] ok [14:32:18] Oh, ew. It's gotta be all-or-nothing. [14:32:42] * Coren curses. [14:33:02] YuviPanda: btw, can you please revive virt1005 sometime soon? It doesn’t make much of a lifeboat at the moment. [14:33:06] YuviPanda: I think we better rip the band-aid off and deal with the fallout. [14:33:36] YuviPanda: moritz is making some security changes and… it would be nice to have a test box [14:33:38] Coren: alright, let me turn it on everywhere. [14:34:35] !log tools turned off dnsmasq for toollabs [14:34:36] Coren: ^ [14:34:38] Logged the message, Master [14:34:48] * Coren forces a puppet run on -submit [14:40:08] Coren: that seems to have fixed it [14:40:44] Yeah, that's expected. I'm trying to see how things fare for the exec nodes now. [14:40:54] ok [14:44:30] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1329624 (10yuvipanda) The node looks kind of dead :| I'll ping @andrew to investigate - if not we can always rebuild it! [14:45:12] afaict, everything is working fine atm. [14:45:32] Hm. I should say "seems to be" [14:45:49] So as to not look too foolish if things explode later. :-) [14:46:02] heh [14:46:11] Coren: should we make them all use their new hostnames? [14:46:18] I didn't expect the hostnames themselves to change :| [14:46:26] hostname -f gives the newer project scoped name [14:46:49] hashar, Krinkle: I’m writing down an upgrade path for projects that use self-hosted puppet with clients. one step involves changing a line in /etc/puppet/puppet.conf by hand on each client. Is that acceptable? (Of course you can use sed and/or salt depending on your setup and preferences) [14:52:45] 10Tool-Labs: Tool checker throwing errors following labs changes - https://phabricator.wikimedia.org/T101097#1329659 (10yuvipanda) ^ was on doing a webservice uwsgi-python start, which is what @legoktm told me this was running on. [14:54:06] Oh, hell. [14:54:20] 10Tool-Labs-tools-Other: Tool checker throwing errors - https://phabricator.wikimedia.org/T101097#1329664 (10yuvipanda) [14:54:55] YuviPanda: The exec node names changing means a restart of the shepherds break. Means every host needs to be readded and queue lists changed. [14:55:06] * Coren does that now. [14:55:33] Coren: so can we force a puppet run, and then just add the newly genrated file definitions for exec nodes that puppet puts out? [14:56:19] Hm no, we don't want to actually _add_ those hosts to the queue as this would do. I need to add, manually edit to queues to rename, then delete. [14:56:31] There is no support for renaming. [14:56:37] Coren: oh, because then they'd kind of forget what jobs they're currently running? [14:56:52] (As this is insanely baroque and unusual) [14:57:34] YuviPanda: I don't know - I only tried (on purpose) on an empty exec node. I'm pretty sure it'd be all bad. [14:57:46] yeah, I won't be surprised [14:58:03] Coren: another option is to make the output of hostname -f go back to what it used to be [14:58:06] But since the change only takes effect on restaring the shepherd, that means the node is restarted anyways. [14:58:08] which is without the project prefix [14:58:17] this we can do with an /etc/hosts entry [14:58:50] I've already fixed everything but exec nodes, and we'd just be setting ourselves up for a bigger fail down the line I think. [14:59:00] Better suffer a bit now than set us up a bomb. [14:59:03] yeah, I agree [14:59:32] And yes, that's a ayb reference. It's old enough that it's retro cool by now. :-) [15:06:41] Coren: hah :) [15:06:50] Coren: keep !logging and let me know if there's something I can do to hlep [15:08:49] I'm still trying to figure out the least painful way forward. [15:09:05] ok [15:21:34] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 100.00% of data above the critical threshold [0.0] [15:25:14] Hm. It might be possible to trick the master. [15:27:34] Coren: how so? [15:27:35] 10Tool-Labs, 10pywikibot-core, 5Patch-For-Review: Support Debian package python-ipaddr - https://phabricator.wikimedia.org/T100603#1329806 (10jayvdb) >>! In T100603#1316531, @jayvdb wrote: > Travis whitelist request posted: https://github.com/travis-ci/travis-ci/issues/3973 no response yet. [15:28:27] YuviPanda: I'm testing now, but it /looks/ like an unrestarted shepherd will work with the new name. [15:28:40] So we can rename hosts without having to restart them [15:28:48] Coren: by 'shepherd' do you mean the gridengine-exec process? [15:28:52] * Coren nods. [15:29:27] Coren: does restarting them actually cause gridengine to lose state info? [15:30:07] You mean while there are jobs? [15:31:47] yeah [15:32:15] I expect it would because the sge_execd is actually the parent of the sge_shepgerd processes. Alternately, the execd might just kill its children (which would also make sense). [15:33:35] I don't think we have a better solution than disable-evacuate-"rename"-enable [15:34:08] Coren: ugh [15:34:17] Ima gonna get started on that then. [15:34:31] Coren: we usually have straggler processes that take days to drain [15:34:53] Hm. [15:34:54] Coren: we could still do the /etc/hosts 'hack'. [15:35:03] Coren: and then we have more breathing room to do the move arounds. [15:35:21] the hack is also trivially puppetizable [15:35:22] 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1329841 (10Andrew) 3NEW a:3thcipriani [15:35:53] I'm not sure how that'd help. The actual amount of work is exactly the same, we're just introducing more confusion for longer. [15:36:14] * Coren ponders. [15:36:42] Alternately, I can add all the "new" hosts [15:36:54] But leave them disabled. [15:37:09] As we evacuate and restart, they'll be picked up automatically. [15:37:10] restarting all jobs is somewhat disruptive and we shouldn't do that if possible, I think [15:37:36] So we can still spread things over a longer period without introducing a hack. [15:38:16] 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1329862 (10thcipriani) p:5Triage>3High [15:38:28] So, we evacuate a fraction, and wait until they drain [15:38:37] however long that takes. [15:38:50] Any exec node that is not restarted will keep on working. [15:39:09] took us more than a week last time [15:39:20] Sure, but nothing would break in the meantime. [15:42:58] Either way, need to add the "new" nodes. [15:43:24] !log tools adding the "new" exec nodes (aka, current nodes with new names) [15:43:28] Logged the message, Master [15:46:39] andrewbogott: yeah we can do the manual step for sure. [15:46:53] andrewbogott: there is even a salt master for beta and integration labs project. Not sure how well it work though [15:47:00] hashar: great. I’ll send an email or something before I break anything. [15:49:45] 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1329841 (10mmodell) this task is binary. [15:53:56] Ah, 1401 only has a single, restartable job on it. Using it as guniea pig [15:54:15] !log tools Switching names for tools-exec-1401 [15:54:18] Logged the message, Master [16:04:21] I'm still getting SGE errors on submission [16:04:24] https://tools.wmflabs.org/gerrit-reviewer-bot/ [16:04:28] (at the bottom) [16:04:34] valhallasw: I am aware. Working on it. [16:06:38] RECOVERY - Puppet failure on tools-exec-1410 is OK Less than 1.00% above the threshold [0.0] [16:08:28] .. there seems to be something intermitently odd about dns. [16:10:34] on tools-submit? that's weird [16:10:53] !log tools restart nscd on tools-submit [16:10:57] Logged the message, Master [16:11:04] note that it's talking to -shadow, not to -master [16:11:10] oh right [16:11:18] yes that's current master [16:11:28] hm, okay. [16:11:55] valhallasw: that was part of https://phabricator.wikimedia.org/T90546 [16:12:23] I see [16:18:16] valhallasw: how does https://etherpad.wikimedia.org/p/bigbrother-almost-no-more sound? [16:18:28] valhallasw: I should probably add a note about not touching the service.manifests [16:19:11] Gah. [16:19:28] 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1330124 (10scfc) https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster#Puppetmaster_Setup explicitly says: > […] When you... [16:20:11] I've no idea wtf is going on. Resolution seems to be random. [16:20:36] !log tools switching back to tools-master [16:20:39] Logged the message, Master [16:26:46] Coren: probably related to the huge /etc/hosts file still on tools-shadow - same issues as last time, I guess [16:26:47] * YuviPanda purges it [16:26:52] Coren: tools-master should be fine [16:27:10] !log tools cleaned out /etc/hosts file on tools-shadow [16:27:13] Logged the message, Master [16:30:41] YuviPanda: tools-master has the same issue, seemingly. [16:31:04] oh wow [16:31:13] so we're back to same spot as last week's outage then... [16:31:35] yuppp [16:32:07] And so, in the end, unrelated to /etc/hosts [16:32:12] yes [16:32:23] well, *maybe* [16:32:31] * Coren attempts to debug [16:33:07] Hm. The DNS inconsistency is really playing hell with gridengine. [16:34:29] Coren: I still think we should put in a /etc/hosts hack that should help with hostname -f output [16:34:31] It almost looks as though the resolver gets random names for the same IP [16:34:41] and then we don't actually have to touch anything for a while? [16:34:47] unless I'm misunderstanding the problme [16:35:31] YuviPanda: It's likely to just make matters worse; now you'll have /etc/hosts contradicting dns and the resolver is likely to get no better for it. [16:35:55] The issue predates the dns switch for the bastions, we're just seeing it more. [16:35:59] Coren: it shouldn't contradict dns, no? like, hitting tools-master.eqiad.wmflabs will also return the correct IP [16:36:28] yes, but reversing the ip won't match the forward host name and while the /clients/ don't care, the master checks. [16:37:24] But right now, the master is getting inconsistent values. I'm trying to track down how exactly. [16:37:29] YuviPanda: edited email a bit [16:37:30] ok [16:37:37] valhallasw: looking [16:43:27] YuviPanda: We may have to cheat after all; gridengine *really* doesn't like hostname changing on the exec nodes where jobs are running. Either we have to restart everything pretty much at once or we have to cheat the names with a hosts hack, and restart everything with intermediate changes to /etc/hosts. [16:43:56] 6Labs, 10Beta-Cluster: Make sure labs ENC knows about cert changes - https://phabricator.wikimedia.org/T101124#1330255 (10Andrew) 3NEW a:3Andrew [16:44:25] Guh. That thing was *really* not meant to have its host names switched. [16:44:35] YuviPanda: there was some ticket that alex worked on about nat and labs vm's [16:44:45] does that relate to https://phabricator.wikimedia.org/T95714 do you know? [16:45:01] Coren: /etc/hosts hack gets my vote [16:45:37] YuviPanda: It's only a partial band-aid; we'll also have to do the whole thing by hand anyways. But it'll get things back to running for now at least. [16:45:50] Coren: what do you mean by 'whole thing by hand'? [16:46:27] evacuate-dequeue-change entry in hosts file-delete old node-requeue. For every exec node. [16:46:36] 10Tool-Labs-tools-Other: Tool checker throwing errors - https://phabricator.wikimedia.org/T101097#1330278 (10Legoktm) 5Open>3Resolved a:3Legoktm So...I added the `@app.route('')` while debugging and forgot to remove it :/ Somehow the webserver was started as lighttpd instead of uwsgi-python, and something... [16:46:54] * Coren ponders [16:47:06] We don't have enough room to just create new nodes, I think. [16:47:13] (That'd also work) [16:47:35] Coren: if we just add the /etc/hosts entry, won't all the current nodes report the same hostname -f as they used to be, and just go an as usual? [16:48:34] They're report with the same fqdn, sure, so it'll wallpaper over the current issue. But it's brittle and will need to be changed. [16:48:55] Coren: sure, but we won't have a continuing outage? [16:49:13] I know, that's why I'm working on generating it now. [16:49:28] wouldn't https://gerrit.wikimedia.org/r/#/c/215355/ be all that's needed? [16:50:07] or probably not. [16:50:08] * YuviPanda test [16:50:08] s [16:50:44] Heh. No; that would not help at all. [16:51:17] The exec node needs to know its fake fqdn for its IP, and the masters need to agreee on that name for that IP as well. [16:52:01] Coren, YuviPanda, sorry this is such a mess :( [16:52:17] andrewbogott: Had to be done eventually. [16:52:29] hi. after the upcoming puppetmaster/dns changes, will labs puppet client certs be based on the fqdn, or still the ec2id? [16:52:44] Coren: I thought gridengine doesn't use rdns? [16:53:14] YuviPanda: So I thought; the clients don't but the master checks since auth is name-based (makes sense, actually) [16:53:28] oh, sigh. right.. [16:53:55] YuviPanda: ~marc/hh contains what should be in the hosts file [16:54:21] 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1330304 (10Andrew) Yes, it works. From puppet: # else assume short hostname and append domain. default => "${... [16:54:33] Coren: ah, and this should be on the master / shadow hosts files? [16:54:55] YuviPanda: Yes, so that it'll match that the exec nodes currently self-report [16:55:12] Coren: alright. [16:55:19] * YuviPanda stands by, doesn't fiddle with anything [16:55:34] I guess we should puppetize that. [16:55:38] * YuviPanda is unsure how exactly to [16:55:52] The only alternative I see is to really just restart everything on the "new" names. [16:56:12] But even then, that's going to cause reporting issues because the old nodes are unable to report in atm. [16:56:24] Coren: we can restart all the restartable jobs and then let the running tasks just continue, I guess? [16:56:53] Sure, but if we do that we can't use the new name for the exec node until it's completely drained. [16:57:06] * Coren ponders. [16:57:27] 6Labs, 10Beta-Cluster: Make sure labs ENC knows about cert changes - https://phabricator.wikimedia.org/T101124#1330344 (10Andrew) Bitrotted patch: https://gerrit.wikimedia.org/r/#/c/202790/ [16:57:32] First things first, lemme band-aid the hosts files so that the grid is back up. [16:57:36] ok [17:01:25] I'm still not getting why resolving atm randomly gets the old or new name. [17:01:51] how are you testing it? [17:02:08] YuviPanda: I'm not, I'm just speaking to what gridengine reports. [17:02:21] ah, that was same as last week's outage then :) [17:03:35] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1330403 (10Tobi_WMDE_SW) [17:06:07] YuviPanda: Oooooh. There may be a bit of trickery we can use that's actually builtin to gridengine that I didn't know about. [17:06:42] go on? [17:06:50] Coren: did the hosts fil hack fix gridengine for now? [17:06:56] (seems to have) [17:06:57] YuviPanda: Only partly. [17:07:29] It's still randomly unreliable. But I think I found a better workaround. [17:07:38] 10Tool-Labs, 3Labs-Sprint-100: Clean up huge logs on toollabs - https://phabricator.wikimedia.org/T98652#1330427 (10yuvipanda) cluebot's error log grew to 1.5T again, deleted and notifying maintainer (https://en.wikipedia.org/wiki/User_talk:Rich_Smith#Cluebot.27s_error_logs) [17:07:42] * Coren reads docs, will emerge shortly. [17:08:21] aha! [17:10:09] chasemp: hey! no, the reason prod can't access labs is because there's a big firewall there explicitly forbidding all access [17:10:31] we have punched some very specific holes (graphite.wmflabs.org and labsdb) but outside of that they're segregated [17:10:41] chasemp: nothing to do with the NAT stuff [17:10:47] right and that seems like a bigger question than this task is asking [17:10:50] ok [17:12:34] chasemp: yeah. lots of things gotta change to do that, I guess. [17:12:53] YuviPanda: http://gridscheduler.sourceforge.net/htmlman/htmlman5/host_aliases.html [17:13:14] hahahaa [17:13:15] wow [17:13:17] nice :) [17:13:23] so that should fix the problem, I hope [17:14:04] I've added an alias file with every exec and submit host. [17:14:22] Yeay strace while trying to debug the issue. [17:14:46] Which allowed me to see it trying to open that file and go "hmmm. 'host_aliases'. whu?" [17:15:27] So if we see more issues, checking that file has the alias should be first step [17:18:22] Ack. almost. [17:21:53] YuviPanda: Still one issue with a partially munged queue (catscan). About to fix then we should be all set. [17:22:53] Coren: ok. we should still figure out a long term solution, though. that might just be rebuilding instances slowly, *again*. (maybe this time we can drop the tools- prefix!) [17:23:03] Coren: should also puppetize the aliases file, I think. [17:23:54] YuviPanda: Probablty. I'll wait until things are fully back up and we have a real plan. [17:25:16] Coren: ok. should also file an incident report. [17:38:28] Coren: back to being dead, btw [17:38:36] 6Labs, 10Labs-Infrastructure: Labs webservice2 not working - https://phabricator.wikimedia.org/T101132#1330616 (10Magnus) 3NEW a:3yuvipanda [17:39:06] YuviPanda: I am aware, trying to fix the damn catscan queue that was only partially converted. ima delete it entirely. [17:39:17] Coren: ok [17:39:29] 6Labs, 10Labs-Infrastructure: Labs webservice2 not working - https://phabricator.wikimedia.org/T101132#1330626 (10yuvipanda) Looks like there's a gridengine master outage again. We are looking into it. [17:44:07] ... oh no. [17:45:40] YuviPanda: You know that great solution that just fixed everything? [17:46:32] YuviPanda: That great solution just ended up effing everything up. Badly. [17:46:57] heh. [17:47:01] what did it do now [17:47:12] YuviPanda: So, earlier, I created a bunch of exec nodes so that we had the new names ready to switch. [17:47:20] YuviPanda: So we had every instance twice. [17:47:27] by 'created a bunch of exec nodes' you mean in gridengine config, right? [17:47:31] YuviPanda: (once under each name) [17:47:34] Yes, in ge config [17:47:51] YuviPanda: So now, with the alias file in place, when I deleted the bunch of new instances? [17:48:08] YuviPanda: It has, in fact, deleted both. Because it thought them equivalent. [17:48:27] * Coren headdesk. [17:48:32] nice. [17:48:47] so it thinks there are no exec nodes now? [17:48:52] Well, in its defence, they _were_. explicitly. [17:49:17] masater is down now, I can't seem to do anything [17:49:22] YuviPanda: In such a way that it is not even possible to actually start the master because the queues are broken. [17:49:26] ah [17:49:27] nice [17:49:36] so [17:49:39] what do we do now? [17:49:48] I guess all running tasks, etc lost state? [17:50:17] YuviPanda: Actually, possibly not. I'm looking at how to rebuild this now. [17:50:33] YuviPanda: the exec nodes are still running and keeping state until the master is back up. [17:50:53] ok [17:51:17] That "simple" dns change ended up being a bit more troublesome than expected. :-/ [17:53:58] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1330711 (10yuvipanda) @Magnus do you mind if I just delete it and recreate it? [18:01:04] Coren: I emailed labs-announce [18:01:16] YuviPanda: ty. I'm busy pulling my hair out atm. [18:02:05] alright. I'm staying away from it atm so we don't have two people poking at it at the same time :) [18:03:52] YuviPanda: Yeah dbd database recovery by timestamp. :-) [18:03:57] bdb* [18:05:27] I still need to make sure everything is actually working right, but things seem back up. :-) [18:08:57] YuviPanda: fwiw the gridengine host alias thing seems to have solved every problem in one fell swoop, if you ignore that temporary "omg I broke the world" bit. :-) [18:12:51] :) [18:13:18] Coren: when you think things have stabilized, can you 1. puppetize the host aliases, 2. write an incident report? [18:13:32] it is possible to use htaccess files on tool labs? [18:13:49] YuviPanda: Only fallout is that gridengine probably lost track of a couple of jobs that have started or ended in the troubled window (approx 30m) but it's unlikely that any job were started during. [18:14:12] kaldari: No, that's an apache-specific thing. But most of what you might have wanted to do with it can be done with lighttpd (just differently) [18:15:09] Coren: Thanks. Is there anything specific I need to know about setting it up, or should I just look up standard instructions for using lighttpd? [18:15:13] YuviPanda: Yes to both, though I expect a couple hours at least. I'm still seeing a few wonky exec nodes and a couple odd things in the logs I want to look into. [18:15:24] Coren: cool [18:16:17] kaldari: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web has recipies for the most common things; but yeah, you can also just look up normal lighttpd config doc [18:16:40] Look at 'Example configurations' [18:28:12] YuviPanda: Do you know what's up with -webgrid-lighttpd-1406? [18:28:25] hmm? is it doing anything? [18:28:57] No, that being the exact issue. :-) [18:30:34] YuviPanda: If you're not working on it, nor know of any issues, it probably just needs a reboot. [18:30:56] Coren: alright. feel free to :) [18:34:49] !log tools rebooting tools-webgrid-lighttpd-1406.eqiad.wmflabs [18:34:52] Logged the message, Master [18:37:21] I've created a new instance for my project (wikidata-query) and it rejects my public key. Old ones work fine. Anybody knows what could be the problem? [18:50:33] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1331014 (10Magnus) Go ahead please! [19:18:12] labs-recursor0.wikimedia.org [19:18:19] Recursive DNS - CRIT [19:18:28] maybe known. dunno [19:18:34] just see it in Icinga [19:21:10] andrewbogott, ^ [19:21:47] I don’t know what that is, but I will look [19:32:13] is there a generic phabricator bug opened for missing record in replicas or must I open a new one? [19:44:13] phe: "Missing records"? [19:48:15] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1331309 (10JanZerebecki) Is there some automated script in some git repo that puts either these dumps there? F... [19:48:21] is there a role for labs I could use to install varnish? [20:12:25] 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1331464 (10thcipriani) Staging, Deployment-prep, and Integration projects are all updated. No-op in all cases. $ RUBYLIB=/var/lib... [20:12:48] 6Labs, 10Beta-Cluster, 10Continuous-Integration-Infrastructure, 10Staging: Change local puppetmaster names for RE labs projects - https://phabricator.wikimedia.org/T101110#1331465 (10thcipriani) 5Open>3Resolved [20:13:50] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 60.00% of data above the critical threshold [0.0] [20:17:25] Coren, missing row in tables, or perhaps index broken [20:18:57] phe: Do you have a specific example in mind? I can look into it quickly and either confirm there is an issue or give you an explanation where there isn't. :-) [20:19:13] commonswiki_p [20:19:22] SELECT img_name, img_sha1 FROM image WHERE img_name like 'Rabier_-_F%'; [20:20:05] this File: is missing: https://commons.wikimedia.org/wiki/File:Rabier_-_Cr%C3%A9tinot,_1922.djvu [20:20:54] anyway that's not the first time that sort of things occur, replicas on labs has never been reliable [20:21:05] ... your query excludes it. :-) You're looking for 'Rabier_-_F%' and this is 'Rabier - C' not F. :-) [20:21:05] this would never match? [20:21:37] how right :D [20:21:53] like 'Rabier_-_C%' returns it as expected. :-) [20:38:51] RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK Less than 1.00% above the threshold [0.0] [21:42:02] !log logstash Deleted all instances; will rebuild project after labs dns switch ~2015-06-11 [21:42:07] Logged the message, Master [21:52:02] andrewbogott: I have an instance named shaved-yak.eqiad.wmflabs in the mediawiki-core-team project. `host shaved-yak.mediawiki-core-team.eqiad.wmflabs` is returning NXDOMAIN. This seems to be a blocker to me skipping ahead to step 4 in the dns migration process [21:52:48] bd808: ok, let me find my email so I can see what step 4 is [21:52:56] "On all puppet clients, edit /etc/puppet/puppet.conf and change the puppetmaster name by inserting the project name before .eqiad.wmflabs." [21:52:58] I love how when you send an email to a mailing list gmail hides it forever [21:53:10] yeah. gmail is fun like that [21:53:32] mind if I log in and poke around? [21:53:38] feel free [21:53:54] my goofy test server is at your disposal :) [21:54:16] your resolv.conf shows you as still using dnsmasq [21:54:40] what project is this in? [21:54:40] when I try to run puppet I get "Error 400 on SERVER: stack level too deep" [21:54:47] ah, ok [21:54:59] if puppet can’t run then… nothing else will work. So let’s figure out about puppet :) [21:55:08] The only change I made was removing use_dnsmasq=true [21:55:14] it was running before that :( [21:55:35] At least I think that's the only change [21:55:48] did you update /var/lib/git/operations/puppet? [21:56:16] yeah. the top upstream commit is 2 hours old [21:56:28] 19bf816 [21:56:48] private too? [21:56:53] Not sure if it matters, just being thorough [21:56:57] probably not, no [21:57:12] ok, I will... [21:57:20] just did it [21:57:40] caught some password change [22:00:10] I restarted the puppetmaster and that seems to have fixed things. [22:00:17] * andrewbogott updates the docs [22:01:01] sweet. thanks [22:02:10] can you take it from here? I’m curious to hear which other parts of my checklist are wrong [22:02:22] yeah. I'll keep going [22:02:40] I suppose I'll need to upde my instance proxies too right? [22:02:45] *update [22:03:03] it all depends on how they’re puppetized. [22:03:15] I mean https://wikitech.wikimedia.org/wiki/Special:NovaProxy settings [22:03:23] Oh, no, that shouldn't matter. [22:03:36] The old fqdn will still resolve. [22:03:47] excellent [22:03:57] It’s just that $domain is different, so anything in puppet that refers to that gets confused. [22:35:32] 6Labs, 10Labs-Infrastructure: Fix syslog error "nslcd[29117]: error writing to client: Broken pipe" - https://phabricator.wikimedia.org/T78616#1332156 (10Gage) There's also a Debian bug report discussing this: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=685504 They propose creating a dummy user in /etc/... [22:36:16] bd808: any luck? [22:36:27] It seems to all be ok [22:36:39] really? [22:36:56] that was a degenerate test really. it was single node host that happened to use a fqdn to reference itself [22:37:08] Ah, ok. Well, one more edge case survived :) [22:37:10] Thanks. [22:37:39] deployment-prep will be the hardest migration I think [22:40:05] yeah, I’ll probably schedule time with Hashar to do that. [22:40:38] I think thcipriani would be the guy to talk to [22:40:57] I think hashar has gladly forgotten how to manage the cluster ;) [22:41:27] 6Labs, 6WMF-Legal: Request to review privacy policy and rules - https://phabricator.wikimedia.org/T97844#1332170 (10d33tah) >>! In T97844#1273445, @ZhouZ wrote: > Hi @d33tah, thank you for your contributions on WikiLabs. The WMF legal team has reviewed your issue and has some thoughts in regards to your quest... [22:43:30] andrewbogott: I'd be down to do deployment-prep [22:43:50] I don't think it'll be too too bad :) [22:43:52] thcipriani: ok, that might make more sense since we’re in the same timezone. [22:44:06] We can do it any day this week, just ping me when you have a few hours. [22:44:12] But not right now ‘cause I’m about to leave [22:44:28] heh, sure. I'll get with you tomorrow to setup some time. [22:45:05] great! [23:25:00] is there a good way to get commons image urls that doesn't involve scraping file pages? [23:25:50] Like, using some combination of the API, the file name, and the page_id? [23:30:28] aha! [23:30:37] iiprop=url