[00:08:45] yuvipanda: qlogin/qsh requires hba [00:08:55] Though I think that's a little-used feature. [00:09:09] (Still part of the gridengine suite, however) [06:30:52] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL 33.33% of data above the critical threshold [0.0] [06:55:53] RECOVERY - Puppet failure on tools-exec-1405 is OK Less than 1.00% above the threshold [0.0] [07:05:59] 10Tool-Labs, 5Patch-For-Review: setup host-based auth for tools hosts - https://phabricator.wikimedia.org/T98714#1275492 (10valhallasw) >>! In T98714#1275365, @yuvipanda wrote: > Come to think of it, why do we allow HBA to exec nodes for users? I've used it for debugging and logging (tail scriptname.log -- NF... [07:06:50] yuvipanda: it works:) https://tools.wmflabs.org/blogconverter/ - thanks again [07:11:21] (03CR) 10Merlijn van Deen: "The main issue with" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/209968 (https://phabricator.wikimedia.org/T98641) (owner: 10Merlijn van Deen) [07:27:20] ~. [08:35:33] (03CR) 10Filippo Giunchedi: "how many packages/repos are we talking about?" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/209968 (https://phabricator.wikimedia.org/T98641) (owner: 10Merlijn van Deen) [08:38:16] PROBLEM - Puppet staleness on tools-shadow is CRITICAL 100.00% of data above the critical threshold [43200.0] [08:45:59] (03CR) 10Merlijn van Deen: "I count 19 python packages in /data/project/.system/deb-*. These are all magic .debs without any source/build script at all. These two (rd" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/209968 (https://phabricator.wikimedia.org/T98641) (owner: 10Merlijn van Deen) [09:40:44] 6Labs, 10Beta-Cluster, 5Patch-For-Review: Move logs off NFS on beta - https://phabricator.wikimedia.org/T98289#1275656 (10hashar) >>! In T98289#1274035, @bd808 wrote: > I might as well take this at this point. :) Thanks for stepping in! I was not sure who might know about our logging processing and would b... [09:44:39] 6Labs, 10Beta-Cluster, 10Labs-Infrastructure, 6operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1275659 (10hashar) [09:50:48] あS34キオGH後;7UUYT6EYTRDGHJDEUYH えっRGRFHRD [09:51:05] SDT6RF4QR5たDGSGFっtrfy89どkdghっっっgfっgf」 [09:52:49] I am sorry that my 2 years old kids input something like above just now [10:20:29] PROBLEM - Puppet staleness on tools-exec-catscan is CRITICAL 100.00% of data above the critical threshold [43200.0] [10:24:14] PROBLEM - Puppet staleness on tools-exec-cyberbot is CRITICAL 100.00% of data above the critical threshold [43200.0] [10:31:52] PROBLEM - Puppet staleness on tools-exec-15 is CRITICAL 100.00% of data above the critical threshold [43200.0] [10:32:14] PROBLEM - Puppet staleness on tools-exec-gift is CRITICAL 100.00% of data above the critical threshold [43200.0] [10:33:10] PROBLEM - Puppet staleness on tools-exec-wmt is CRITICAL 100.00% of data above the critical threshold [43200.0] [10:36:09] PROBLEM - Puppet staleness on tools-exec-08 is CRITICAL 100.00% of data above the critical threshold [43200.0] [10:36:49] PROBLEM - Puppet staleness on tools-exec-14 is CRITICAL 100.00% of data above the critical threshold [43200.0] [10:37:03] PROBLEM - Puppet staleness on tools-exec-07 is CRITICAL 100.00% of data above the critical threshold [43200.0] [10:46:17] hello the instance maps-warper appears to be in SHUTOFF state is there anything I can do? [10:48:47] 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1275772 (10hashar) Thread via Gmane is at http://thread.gmane.org/gmane.comp.cloud.openstack.general/9946 [11:04:13] I'll create an issue ;) [11:13:28] 6Labs, 10Maps: Activate maps-warper labs instance due to SHUTOFF state. - https://phabricator.wikimedia.org/T98730#1275779 (10Chippyy) 3NEW [11:25:10] I want to create new project in wikimeida-labs [11:33:40] I already registered at wikitech [11:45:16] konggaru: create a subtask under https://phabricator.wikimedia.org/T76375 [11:51:54] 6Labs, 7Tracking: create "Furutani-bot" project - https://phabricator.wikimedia.org/T98732#1275836 (10konggaru) 3NEW [11:55:56] 6Labs, 7Tracking: create "Furutani-bot" project (drop) - https://phabricator.wikimedia.org/T98732#1275844 (10konggaru) 5Open>3declined [11:55:57] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1275848 (10konggaru) [11:58:17] I closed that subtask because I need only toollabs account... [11:59:12] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/콩가루 was created, changed by 콩가루 link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/%ec%bd%a9%ea%b0%80%eb%a3%a8 edit summary: Created page with "{{Tools Access Request |Justification=I will run a bot that alert talk page abuse on kowiki(like editing other one's talk, deleting other one's talk, or using other one's sign..." [12:00:36] Better so then :) [12:11:02] 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1275862 (10akosiaris) Antonio Messina on that thread is also giving the exact same explanation for the problem. However he is also providing a solution for the "SNAT to what and when" question I posed befor... [12:22:51] 6Labs, 10Labs-Infrastructure: Instance in SHUTOFF state, not rebooting - https://phabricator.wikimedia.org/T98602#1275876 (10jkroll) [12:31:12] 6Labs, 10Labs-Infrastructure: Instance in SHUTOFF state, not rebooting - https://phabricator.wikimedia.org/T98602#1275888 (10jkroll) [12:32:03] 6Labs, 10Labs-Infrastructure: Instance in SHUTOFF state, not rebooting - https://phabricator.wikimedia.org/T98602#1272222 (10jkroll) [12:32:26] 6Labs, 10Labs-Infrastructure: Instance in SHUTOFF state, not rebooting - https://phabricator.wikimedia.org/T98602#1275893 (10Andrew) This is almost certainly my fault, I'm trying to fix right now. [12:34:12] 6Labs, 10Labs-Infrastructure: Instance in SHUTOFF state, not rebooting - https://phabricator.wikimedia.org/T98602#1275903 (10Andrew) ok... is that better? The (boring) explanation is that this instance is quite old and requires a backing image that I didn't realize was still in use. I migrated it to the new... [12:35:26] 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1275904 (10akosiaris) Managed to get something working. Specifically I added ``` iptables -t nat -I POSTROUTING 1 -j SNAT -s 10.68.16.68 -m conntrack --ctstate DNAT --to-source 208.80.155.155 ``` which al... [12:37:04] hmm, maybe my one may be similar [12:39:00] 6Labs, 10Maps: Activate maps-warper labs instance due to SHUTOFF state. - https://phabricator.wikimedia.org/T98730#1275906 (10Chippyy) could be related to https://phabricator.wikimedia.org/T98602 ? [12:39:16] chippy: looking! [12:39:23] thanks andrewbogott [12:41:38] chippy: I thought I went through and fixed all of those instances but I must’ve missed a few :( [12:41:56] andrewbogott, that's cool :) It might be related to it being low in disk space [12:42:34] nah, I think it’s just that the instance pre-dates the disk images that I copied [12:42:44] ahh I see [12:43:00] Mornin' Andrew. NFS seems to have been well behaved this weekend? [12:43:08] It should be booting now [12:43:23] Coren: yeah, things have been pretty solid apart from the standard flood of puppet alerts. [12:43:45] thanks! [12:44:24] 6Labs, 10Maps: Activate maps-warper labs instance due to SHUTOFF state. - https://phabricator.wikimedia.org/T98730#1275912 (10Andrew) 5Open>3Resolved a:3Andrew [12:44:51] andrewbogott: Yeah, I'm pretty sure those aren't actually related to storage - but I'll be [bleep]ed if I can figure them out. [12:45:27] Coren: I was going to try to increase the puppet interval just to see if it’s related to load on virt1000. Haven’t gotten around to it, though, forking that code for labs will be slightly messy [12:45:31] thanks andrewbogott :) [12:46:24] * Coren goes to fetch cafeine [14:16:50] 6Labs: Upgrade Labs Compute nodes to Trusty - https://phabricator.wikimedia.org/T90822#1276102 (10Andrew) 5Open>3Resolved [14:16:52] 6Labs: Upgrade labs cluster to Trusty - https://phabricator.wikimedia.org/T90821#1276103 (10Andrew) [14:19:39] 6Labs, 10hardware-requests, 6operations: labnet1002 - https://phabricator.wikimedia.org/T98740#1276111 (10Andrew) 3NEW [14:20:09] 6Labs, 10Labs-Infrastructure: Instance in SHUTOFF state, not rebooting - https://phabricator.wikimedia.org/T98602#1276119 (10jkroll) Great, instance is up and running now. Thanks a lot Andrew! [14:20:11] 6Labs, 10hardware-requests, 6operations: labnet1002 - https://phabricator.wikimedia.org/T98740#1276121 (10Andrew) [14:20:13] 6Labs: Upgrade labs network node to trusty - https://phabricator.wikimedia.org/T90823#1276120 (10Andrew) [14:21:15] 6Labs, 10Labs-Infrastructure: Instance in SHUTOFF state, not rebooting - https://phabricator.wikimedia.org/T98602#1276122 (10Andrew) 5Open>3Resolved a:3Andrew [14:21:38] 6Labs, 10Labs-Infrastructure: Increase RAID6 sync_speed_min to a sensible level - https://phabricator.wikimedia.org/T98456#1276127 (10coren) 5Open>3Resolved Resync has completed late last week. [15:10:25] 6Labs, 10Labs-Infrastructure, 6operations: Migrate Labs NFS storage from RAID6 to RAID10 - https://phabricator.wikimedia.org/T96063#1276242 (10coren) The plan is to gradually evacuate the pv on individual raid arrays with pvmove, reconfigure the freed raid arrays with raid 10, and recreate pv on the new arra... [15:24:10] 10Quarry: Seeing all of created queries in one page - https://phabricator.wikimedia.org/T98688#1276295 (10Capt_Swing) See also: # https://phabricator.wikimedia.org/T77948 # https://phabricator.wikimedia.org/T88920 [15:47:33] 6Labs, 10Labs-Infrastructure, 6operations: Migrate Labs NFS storage from RAID6 to RAID10 - https://phabricator.wikimedia.org/T96063#1276362 (10coren) An alternative plan, based on input from @mark, that front loads the thin pool move to give performance improvement earlier. With a bit of extra juggling (bec... [16:17:47] 10Tool-Labs, 10Wikidata: Add wb_changes_subscription and wbc_entity_usage to labs db replication - https://phabricator.wikimedia.org/T98748#1276437 (10aude) 3NEW [16:19:42] 10Tool-Labs, 10Wikidata, 5Patch-For-Review, 3Wikidata-Sprint-2015-05-05: Add wb_changes_subscription and wbc_entity_usage to labs db replication - https://phabricator.wikimedia.org/T98748#1276453 (10aude) [16:25:43] andrewbogott: yuvipanda: As a reminder (and just to confuse yuvi's tz clock further) I'm travelling tomorrow then will be working with UTC+2 schedule until the Hackathon. [16:26:20] Coren: there’s no chance that I’ll remember that, but… thanks for the warning :) [16:29:05] bblack: Gentle reminder re https://gerrit.wikimedia.org/r/#/c/209558/ :-) [16:30:07] bblack: Ignore me; I hadn't noticed the unrated comments. :-) [16:39:06] Coren: re that patch... [16:39:48] I don't think it will ever make sense to have several functional modules on a host applying unrelated sets of tc rules to one shared interface. tc is awful enough as it is, and that would be crazy. [16:40:27] I see this as something where we probably want one tc::ruleset { 'bond0': [ .... ] } set at the machine-role level, blocking any other on the same interface [16:40:53] or whatever you want to name/define it as [16:40:58] Well, I can easily contrive of scenarios where it would not be insane to do so - but I see your point just as well. Simplicity wins by default. [16:41:40] yeah you could do that in theory, but we'd need some infrastructure in puppet with a deeper understanding of how rules hook into each other to make that sane. [16:42:05] sounds like something to leave for whoever thinks they really need that down the road, if ever. tc will probably only see limited use globally anyways... [16:42:18] Yeah, I agree. [16:42:25] * Coren will revise accordingly. [16:56:31] "Coren is now known as Coreurope" [16:56:54] Good luck with the jetlag :) [17:23:26] valhallasw: I suffer no letlag in that direction; it's the return that I'll play. [17:23:29] pay* [17:39:05] Coren: what do you think we should do with out .debs? Would it be useful to move to a prod-like git+build system? [17:39:08] with our .debs* [17:39:50] Consistency always scores points; but I'm not sure I've quite decided which was the most sensical in our context. [17:40:10] *nod* See https://gerrit.wikimedia.org/r/#/c/209968/ for some more context [17:41:06] it's not entirely clear to me how the entire prod build system works though, or what parts could actually be used by non-WMF staff [17:43:11] (and which parts are automated and which aren't) [17:46:57] (03PS1) 10Andrew Bogott: Added dummy entries for ceilometer [labs/private] - 10https://gerrit.wikimedia.org/r/210102 [17:48:23] yuvipanda, valhallasw : re: "main advantage is that we can move it into thesame repo as the puppet classes, so doc changes and infrachanges are together" [17:51:35] For those people that didn't notice I updated the epic image :) https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools#/media/File:Labs_Tools_topology.png ;) [17:51:58] puppet stuff is alreadypublished to https://doc.wikimedia.org/ ... somehow. I'm all in favor of doc living with code (T48526), so code changes can be -1d until the doc is updated. But we really need search that spans doc.wikimedia.org and wikitech.wm.o, otherwise people will stick to one or the other (T87802). Meanwhile you can put soft redirects on wikitech to doc.wikimedia.org. [17:59:22] (03CR) 10Andrew Bogott: [C: 032] Added dummy entries for ceilometer [labs/private] - 10https://gerrit.wikimedia.org/r/210102 (owner: 10Andrew Bogott) [17:59:37] (03CR) 10Andrew Bogott: [V: 032] Added dummy entries for ceilometer [labs/private] - 10https://gerrit.wikimedia.org/r/210102 (owner: 10Andrew Bogott) [18:03:07] yuvipanda: hello! do you have a minute for a talk? [18:07:39] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/콩가루 was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=158392 edit summary: [18:11:32] ImportError: No module named datetime [18:11:35] * legoktm scratches head [18:12:04] is there a precise bastion somewhere [18:12:05] ? [18:12:13] legoktm: no, but we have precise-dev [18:12:34] tools-precise-dev? [18:12:48] yeah [18:13:09] legoktm@tools-precise-dev:~$ become legobot [18:13:09] become: command not found [18:13:27] .... wtf [18:14:32] !log toolsbeta building toolsbeta-pbuilder to experiment with pbuilder for building packages [18:14:38] Logged the message, Master [18:14:47] okay, sudo'd in manually for now [18:14:56] The last Puppet run was at Sun May 10 22:58:41 UTC 2015 (1156 minutes ago). [18:14:58] *frown* [18:15:18] ^[[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[pyflakes] is already declared in file /etc/puppet/modules/toollabs/manifests/exec_environ.pp:392; cannot redeclare at /etc/puppet/modules/toollabs/manifests/dev_environ.pp:98 on node i-00000c3e.eqiad.wmflabs^[[0m [18:15:20] orite [18:15:30] how did we miss that [18:15:50] valhallasw: I don't think it's using the shared crontab either [18:16:01] legoktm: no, puppet manifest broken == everything broken [18:16:13] ok [18:17:30] puppet being stupid again [18:19:49] oh, great, icinga is dead [18:20:24] 6Labs, 7Icinga: https://icinga.wmflabs.org/ down - https://phabricator.wikimedia.org/T98765#1276963 (10valhallasw) 3NEW [18:20:44] oh, wait, but I need shinken [18:21:23] which also didn't notice puppet not working? wth. [18:26:43] yuvipanda: ^ any clue how that's possible? [18:26:45] (03PS1) 10Andrew Bogott: Renamed passwords::ceilometer to passwords::openstack::ceilometer [labs/private] - 10https://gerrit.wikimedia.org/r/210109 [18:26:46] yuvipanda: oh, I know. We didn't have any host that included dev_environ w/ precise before, and tools-precise-dev never registered in graphite/shinken [18:28:46] I had to look up what ceilometer was when I saw that email :p [18:29:09] bah, toolsbeta has reached quota. Who should I poke for that again? [18:35:04] valhallasw: me, although… do you really need more or can you free some unused instances? [18:35:18] andrewbogott: I'm not sure; most of them were created by scfc [18:35:32] and most of it is replicating tools, so having 2 exec hosts makes sense [18:36:24] 7 hosts to replicate all of tools, 1 puppetmaster, 2 for testing new configs (jessie / logstash) [18:36:36] I can kill the logstash one, I think [18:38:33] valhallasw: I raised it by two instances, will that get you unstuck for now? [18:38:42] andrewbogott: yes, thanks [18:39:25] valhallasw: I’m going to lunch; ping me in ~45 if you find yourself stuck again. [18:39:42] andrewbogott_afk: thanks; I just needed one to play around with pbuilder, so this should be enough! [18:45:47] Coren, could you comment on the tools-mailrelay behavior in https://phabricator.wikimedia.org/T97574#1277010 ? Is that intended? [18:46:00] * Coren checks. [18:47:43] valhallasw: You mean mail to valhallasw@tools.wmflabs.org? [18:48:03] no, to @arctus.nl [18:48:08] the last one [18:48:40] Wait, you're testing for not-delivering? Oh, open relay check. Yes, the bastions are "inside" [18:49:44] ah, okay. Any suggestion how to test for open relay otherwise? Should I just assign an external IP? Or are there hosts in labs that do not count as inside? [18:50:17] Well, you'll need to have a public IP anyways to have a mail relay at all in the first place, so you might as well do it. [18:50:27] And no, anything in 10/ is inside. [18:50:47] well, the idea was it'll take tools-mail's place [18:51:06] but adding an extra external IP and handling it via MX records is sensible as well, I guess [18:53:45] Coren: as for valhallasw@tools... that's because that's not supposed to work, right? I should use a tool account for that :D [18:59:35] 10Tool-Labs: Provision and test tools-mailrelay-01 - https://phabricator.wikimedia.org/T97574#1277075 (10Legoktm) [19:06:13] valhallasw: Right. [19:06:29] valhallasw: Wait, no, that /should/ work also, but send to your wikitech confirmed address. [19:06:44] Hum, gridengine is failing to install again on tools-precise-dev and I forgot what the issue was [19:06:57] yeah, tools.* is also not working, so it's something else that's wrong [19:09:00] ahahahah [19:09:09] SMTP error from remote mail server after end of data: host ASPMX.L.GOOGLE.COM [74.125.22.26]: 550-5.7.1 [208.80.155.188 11] Our system has detected that this message is\n550-5.7.1 not RFC 2822 compliant. To reduce the amount of spam sent to Gmail,\n550-5.7.1 this message has been blocked. Please review\n550 5.7.1 RFC 2822 specifications for more [19:09:10] information. w140si7312939qha.15 - gsmtp [19:09:18] that's what I get for manually SMTP'ing [19:10:16] Coren: did you get the email with subjet 'test remote' and content 'blah'? [19:10:26] exim suggests you should have [19:13:15] valhallasw: I did, it was spambinned. [19:13:23] Great. [19:13:38] 10Tool-Labs: Provision and test tools-mailrelay-01 - https://phabricator.wikimedia.org/T97574#1277149 (10valhallasw) >>! In T97574#1277070, @valhallasw wrote: > [ ] delivering remote mail to tools.admin@tools.wmflabs.org ** still failing ** ``` 2015-05-11 18:56:38 1YrssK-00070S-GK ** valhallasw@arctus.nl Coren, can you switch the mx record? [19:14:44] I can. how confident are you that this won't blackhole spam? [19:14:49] s/spam/email/ [19:15:04] I'm pretty sure this issue was because of me manually smtp'ing using telnet [19:15:14] the mail I sent locally with mail worked fine [19:15:27] `mail` [19:16:40] mx and spf change will take some time to kick in because caching (and just as long to return if we switch back). No more testing to do first? [19:17:36] see https://phabricator.wikimedia.org/T97574#1277010 for what I tested; I'm not sure what more I should test than those options [19:17:47] I can try an actual email client via smtp [19:18:58] No, it's allright; I was just making sure you were ready to go live. Do you want me to add as mx or make it the only one? [19:20:08] Wait, more important question - are you expecting mail outgoing from the current relay at all? [19:20:32] (i.e.: are all queues drained, nothing pointing at it)? Because if I remove it as mx spf will fail for it. [19:20:45] thank you for pinging me :p [19:20:50] (np though) [19:21:09] Coren: I'd keep the spf record for now [19:21:35] SPF|Cloud: Honestly, you shouldn't have irc ping you on a common tla while idling on technical channels. :-) [19:21:38] Coren: the queues are not empty, and the servers have not switched over internally either [19:21:46] (as in: the other hosts still use tools-mail atm) [19:21:55] valhallasw: That means I must not remove the mx for it. :-) I'll add yours. [19:22:29] Coren: but then there could be incoming mail there -- that's a bit of a catch-22 [19:22:44] Coren: yeah, I know (I'm in ~20 of those technical channels) :) but it's okay, just say whatever you need to say [19:22:47] valhallasw: It's also a safety net. :-) [19:22:51] *nod* [19:23:21] valhallasw: But also, ldap has decided it no longer wants me to save the record because it has a txtrecord entry. [19:23:28] Coren: so: add MX record, switch over internally, wait for a week or so, see if/why queues haven't flushed, remove MX record? [19:23:41] * Coren says really really evil things about opendj [19:23:56] valhallasw: That sounds like a reasonable plan. [19:24:05] Brilliant. Let's do that :-) [19:24:20] PROBLEM - Puppet failure on tools-static-02 is CRITICAL 25.00% of data above the critical threshold [0.0] [19:24:22] * Coren strangles opendj. [19:25:28] Where can I read about jsub, jstart? Are they standard Linux things with man pages I could read, or are they special Tool Labs stuff? [19:25:51] harej: man jsub should work, I think [19:25:56] harej: 'man jsub' and 'man jstart'; but they are just wrappers around qsub [19:25:59] * valhallasw strangles puppet + cdnjs [19:26:06] harej: also https://wikitech.wikimedia.org/wiki/Help:Tool_Labs [19:26:16] valhallasw: it should recover on next run [19:26:30] ESC[mNotice: /Stage[main]/Toollabs::Static/Git::Clone[cdnjs]/Exec[git_pull_cdnjs]/returns: Auto-merging build/packages.json.jsESC[0m [19:26:30] ESC[mNotice: /Stage[main]/Toollabs::Static/Git::Clone[cdnjs]/Exec[git_pull_cdnjs]/returns: CONFLICT (content): Merge conflict in build/packages.json.jsESC[0m [19:26:31] ESC[mNotice: /Stage[main]/Toollabs::Static/Git::Clone[cdnjs]/Exec[git_pull_cdnjs]/returns: Automatic merge failed; fix conflicts and then commit the result.ESC[0m [19:26:32] are you sure? [19:26:39] wait what, merge conflict [19:26:40] wtf [19:26:43] yeah. [19:26:43] did they force push [19:26:49] let me take a look [19:27:33] yuvipanda: also, why is it git pulling from a puppet manifest :P [19:27:47] valhallasw: so if it fails we get an alert :) [19:27:48] like ^ [19:27:59] ensure => latest [19:28:00] yuvipanda: Repeat fyi/reminder since you're now visibly active: I'm travelling tomorrow then working at UTC+2 until the Hackathon. I'll give you my mobile there once I get the new sim. [19:28:33] Coren: cool :) mark wants us to start doing the sprints stuff again, but I guess we need to do it slightly differently. I’ll email shortly. [19:29:03] yuvipanda: I was wondering why we don't .debify it, but 11G. Fine :D [19:29:09] valhallasw: :P [19:29:14] valhallasw: git itself struggles with that. [19:29:42] yuvipanda: so "jstart [thing to execute]" is sufficient for my crontab? [19:30:05] valhallasw: I... have no idea how to fix the stupid thing with opendj. I can't save the dumb record because opendj has decided that the record it *already* has is not acceptable to itself. [19:30:08] yuvipanda: yeah, they force pushed [19:30:32] valhallasw: dicks [19:30:33] let me reset [19:30:45] valhallasw: I'll need to dig into how opendj handles its schemas to debug this. [19:30:56] yuvipanda: https://github.com/cdnjs/cdnjs/commit/7a790e9fc74e9d9eb0e9cabe5febb0015e459d41 vs https://github.com/cdnjs/cdnjs/commit/ab3eb438ae72c0de0a24e4d06241c31cb16fdb79 [19:31:06] yuvipanda: wait, let me diff first [19:32:10] https://github.com/cdnjs/cdnjs/compare/7a790e9fc74e9d9eb0e9cabe5febb0015e459d41...ab3eb438ae72c0de0a24e4d06241c31cb16fdb79 [19:32:46] valhallasw: the diff’s hilarious [19:33:01] valhallasw: I mean [19:33:02] git diff [19:33:03] on the host [19:33:14] https://www.irccloud.com/pastebin/cU6iSItK [19:33:28] wat. [19:33:34] that's from what to what? [19:34:12] valhallasw: that’s just running a git diff [19:34:17] I did a git pull [19:34:19] and that failed [19:34:27] because apparently there were uncommited changes [19:34:28] that's what it wants you to manually merge...? [19:34:30] and this was the uncommited change [19:34:31] wat. [19:34:34] yeah [19:34:40] but they also force pushed [19:34:43] so... dunno. [19:35:21] valhallasw: I am just going to reset. it’s probably also git being flaky on a 11G repository [19:35:39] ... 11G repo? [19:35:43] anyway, should be fine to reset. Nothing malicious inserted [19:37:31] Coren: mm, first remove MX record then shut down exim on tools-mail or the other way around? If we do it the other way around, there won't be outgoing mail and incoming mail should automatically be diverted to tools-mailrelay-01 [19:37:31] Coren: https://github.com/cdnjs/cdnjs [19:38:37] valhallasw: I can't touch the ldap record /at all/ as things now stands because opendj is throwing a fit. The best you could do if you absolutely want to do this now is to move the mail.tools.wmflabs.org name to your ip. [19:38:50] Coren: No hurry, just preparing the plan. [19:38:56] valhallasw: done (re: reset) [19:39:15] mainly wondering whether the spf issue is best solved by first shutting exim down, or by just not caring too much about it [19:39:43] valhallasw: By not caring about it until the queues are drained and just turning it off. Ima put yours at higher priority anyways. [19:42:29] yuvipanda: Also fyi, I've upped the I/O priority of the backup a bit. Hasn't affected the curves much but still wanted you to be aware just in case. [19:43:52] Coren: ah ok! ETA? [19:44:57] *knock on wood* Should be another 20-30h I'm guessing. It's currently in the middle of the worst part (tiles, millions of small files). But its speed is varying a lot depending on who else is using nfs so it's really unpredictable. [19:46:27] 10Tool-Labs: Provision and test tools-mailrelay-01 - https://phabricator.wikimedia.org/T97574#1277222 (10scfc) NB: As shown by your test mail above, changing IPs will make other hosts reject mails as spam! (Exclamation marks used intentionally :-).) The current mail relay puppetry will announce itself to the w... [19:48:59] Coren: right. [19:49:20] RECOVERY - Puppet failure on tools-static-02 is OK Less than 1.00% above the threshold [0.0] [19:51:54] fsck. [19:52:00] okay, more mailrelay stuff for a later date [19:53:08] Coren: so we should also postpone adding the MX record [19:53:16] until that hostname issue is also resolved [19:53:50] That's the same issue. The mx and the spf all live in the very soa record ldap refuses to let me edit. [19:54:18] * Coren is poring over documentation on how to convince opendj to be sane. [19:55:16] Coren: I think scfc refers to how exim identifies itself over smtp, which should be the actual host name [19:55:22] i.e. the EHLO [19:55:53] * valhallasw thinks scfc should be on IRC more often =p [19:56:00] That's immaterial so long as whichever hostname is an mx. :-) [19:56:05] Hence: same problem. :-) [19:56:18] Yeah, I haven't seen scfc on irc in ages. [19:56:38] I asked [19:56:46] he said he gets a lot more work done when not on IRC :) [19:56:55] and that it also forces others to deal with him via phab tickets [19:56:57] Coren: I don't see how it's the same problem. If we switch off tools-mail, that means we suddenly get stuff blackholed, then? [19:56:58] which is a fair enough argument [19:57:34] we could move tools-ops discussion into conpherence, I guess [19:57:55] nooooo [19:58:02] ;-) [19:58:25] valhallasw: No; you can have more than one mx and they can all be dead but one. [19:58:58] Coren: wait, let me recap. When SMTPing to another host, both exims identify themselves as mail.tools.wmflabs.org [19:59:12] 6Labs, 7Icinga: https://icinga.wmflabs.org/ down - https://phabricator.wikimedia.org/T98765#1277248 (10scfc) [19:59:14] 6Labs, 10Labs-Infrastructure: Labs: both Icinga and Ganglia not accessible: 502 Bad Gateway - https://phabricator.wikimedia.org/T85318#1277249 (10scfc) [19:59:16] ... wait. Why did you get your relay to lie? [19:59:26] Coren: that's because it's hardcoded in the config [19:59:30] primary_hostname = mail.MAILDOMAIN [19:59:48] valhallasw: Heh. That needs to be unhardconfiged. :-) [20:00:18] The primary_hostname should be... the primary hostname. :-) [20:00:18] is there a way to get the external hostname in puppet? [20:00:30] 6Labs, 10Labs-Infrastructure: Labs: both Icinga and Ganglia not accessible: 502 Bad Gateway - https://phabricator.wikimedia.org/T85318#943931 (10scfc) The DNS record for `icinga.wmflabs.org` has resurrected: ``` [tim@passepartout ~]$ host icinga.wmflabs.org icinga.wmflabs.org has address 208.80.155.156 [tim@p... [20:00:38] valhallasw: No; you'll have to stuff it in puppet or hiera. [20:00:53] * valhallasw thinks [20:03:09] I'm going to leave that for now. [20:03:55] Coren: btw, I wrote this small script to help run ssh commands on a multitude of tools hosts by type: https://github.com/yuvipanda/personal-wiki/blob/master/tools-dsh-generator.py [20:04:18] 6Labs, 10Labs-Infrastructure: Labs: both Icinga and Ganglia not accessible: 502 Bad Gateway - https://phabricator.wikimedia.org/T85318#1277257 (10yuvipanda) Who keeps doing that... [20:05:19] 6Labs, 10Labs-Infrastructure: Labs: both Icinga and Ganglia not accessible: 502 Bad Gateway - https://phabricator.wikimedia.org/T85318#1277260 (10yuvipanda) Deleted. [20:05:48] yuvipanda: GIVE US REDIRECTS [20:05:48] :P [20:05:55] because I can't ever figure out what the right names of those tools are [20:06:08] keeping an instance up just for redirects sounds eh. [20:06:13] but I guess we can. should do... [20:06:42] we have an instance somewhere for TS redirects as well, right? [20:06:54] yeah, this could point to that [20:06:59] it’s not in a repo anywhere, tho [20:06:59] relic.eqiad.wmflabs [20:07:16] https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000075d.eqiad.wmflabs [20:07:25] Coren: btw, https://github.com/yuvipanda/personal-wiki/blob/master/tools-dsh-generator.py is a helper script I wrote to make using pssh / dsh easier with tools, because fuck salt. [20:07:27] it's explicitly in toolserver-legacy, though [20:07:46] so that's maybe not the sensible project [20:07:47] oh well [20:08:07] yuvipanda: I dunno about having sex with salt - seems painful. But I had a couple of similar scripts myself. :-) [20:08:13] :) [20:19:44] 10Tool-Labs: Provision and test tools-mailrelay-01 - https://phabricator.wikimedia.org/T97574#1277295 (10coren) The mail relay, whatever its name, should most certainly not be lying about it. Both names will be MXes, and the spf is set to mx -all so that'll just work. If nobody lies. :-) [20:19:48] test [20:21:13] saki: Your test clearly failed. [21:14:48] Coren: you around? [21:24:13] 10Tool-Labs, 5Patch-For-Review: Multiple queue runners on tools-mail - https://phabricator.wikimedia.org/T74867#1277579 (10valhallasw) @scfc: do you have time to look at the patch in the near future? I'm thinking it might be a better idea to merge it before continuing the migration to -mailrelay-01, which als... [21:27:03] yuvipanda: can you try reproducing https://phabricator.wikimedia.org/T98586? [21:27:11] on phab-02 [21:27:15] er -01 [21:43:48] 10Tool-Labs: Grants for my Tools-db missing to insert new lines - https://phabricator.wikimedia.org/T98790#1277663 (10Kolossos) 3NEW [21:50:29] How is it possible, that a job sent to the job queue is terminated after approx. one hour with exit code 130? This is "Script terminated by Control-C", but there is obviously no ctrl-c on the exit nodes... [21:53:18] apper: could it be interpreting a HUP as ^C? [21:55:30] Negative24: how could this happen? I've running these jobs (it's more than one job) for more than a year and never had these problems... Who sends HUP? It's basically a php script, so I don't know how this is handled [21:55:50] no idea [22:04:26] apper: out of memory? Check with qacct -j [22:05:45] And sorry, but can't help you further right now; I need to go to bed. [22:07:36] 6Labs, 10Beta-Cluster, 5Patch-For-Review: Move logs off NFS on beta - https://phabricator.wikimedia.org/T98289#1277733 (10bd808) Sort of related, the Monolog config for beta was messed up making most things go to the default `wfDebug` udp2log file and not to Logstash at all. This was fixed with {T98772} by m... [22:10:24] is there a better way for qacct? It starts with all runs since the beginning of 2014... and if I use the -b parameter to have a start date, it only outputs current runs, but needs the same time (I think about 10 minutes)... [22:13:22] and maxvmem is lower than the -mem parameter for jsub [22:13:55] I will see if bigbrother works well and go to bed now... [22:21:56] 10Tool-Labs: Clean up huge logs on toollabs - https://phabricator.wikimedia.org/T98652#1277783 (10yuvipanda) NFS backup still going on, will be at least 3 days before it's done, I think. [22:36:54] 6Labs, 10Beta-Cluster, 5Patch-For-Review: Move logs off NFS on beta - https://phabricator.wikimedia.org/T98289#1277870 (10bd808) I removed `role::logging::mediawiki` from deployment-bastion and tried to manually clean up the services and config that were orphaned by that role being removed. There are certain... [22:37:48] 6Labs, 10Wikimedia-Extension-setup, 10wikitech.wikimedia.org, 5Patch-For-Review: Re-enable OAuth in Wikitech - https://phabricator.wikimedia.org/T98567#1277871 (10yuvipanda) I suppose we could just delete those tables and start afresh. [22:50:54] 6Labs, 10Wikimedia-Extension-setup, 10wikitech.wikimedia.org, 5Patch-For-Review: Re-enable OAuth in Wikitech - https://phabricator.wikimedia.org/T98567#1277893 (10Krenair) Yeah... Nothing particularly useful here: ```MariaDB [labswiki]> select count(*) from oauth_accepted_consumer; +----------+ | count(*)... [22:52:21] 6Labs, 10Wikimedia-Extension-setup, 10wikitech.wikimedia.org, 5Patch-For-Review: Re-enable OAuth in Wikitech - https://phabricator.wikimedia.org/T98567#1277897 (10Legoktm) Why do we want OAuth on wikitech again? [22:53:07] 6Labs, 10Wikimedia-Extension-setup, 10wikitech.wikimedia.org, 5Patch-For-Review: Re-enable OAuth in Wikitech - https://phabricator.wikimedia.org/T98567#1277898 (10yuvipanda) It's LDAP authentication by proxy. You shouldn't ask for LDAP credentials on Labs instances / Tools projects, so this is the way to d... [22:54:52] yuvipanda: what's the usecase? [22:55:28] legoktm: shinken.wmflabs.org needs LDAP auth, for example. and in the future, we want logstash for toollabs and have it allow only people who are admins access to the web interface. [22:56:18] yuvipanda: enabling OAuth for ldap access seems like a rather complicated solution to this problem... [22:56:44] all the solutions are complicated [22:57:02] because we want to exchange credentials outside of labs [22:57:12] s/labs/labs user space/ [22:57:34] legoktm: alternatives welcome? [22:57:51] I was just starting to work with Ryan on a solution for the problem when he found a jobs he liked better [22:58:23] hehe [22:58:52] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 66.67% of data above the critical threshold [0.0] [23:04:23] legoktm: bd808 I think OAuth with wikitech is a reasonable solution for now. [23:04:33] or at least, *way* better than rolling our own :) [23:05:50] yuvipanda: I'm mainly concerned that no one is testing OAuth integration with LDAPAuthentication, and I don't think we want random apps authenticating against wikitech instead of SUL. [23:06:10] legoktm: we’ll probably be just a lot more cautious about granting perms. [23:11:04] 6Labs, 6operations, 10wikitech.wikimedia.org: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1277913 (10Krenair) [23:12:04] yuvipanda, what happened here? https://phabricator.wikimedia.org/project/profile/832/#6621 [23:12:26] Krenair: uhm, no idea. [23:12:35] Krenair: chasemp ^ do you know? [23:12:41] twentyafterfour: ^ [23:12:55] they made it joinable by admin/ops only [23:13:28] the members list is a bit weird too [23:13:35] who is Gryllida? [23:13:36] at least one person is not an admin/nda/ops/etc. [23:13:40] that was a manual change, not sure who did that though since it was done with the "admin" account [23:13:49] I've seen them on wikimedia mailing lists, but.. [23:17:50] yuvipanda: gry? aka svetlana? [23:17:58] possibly [23:18:04] no idea why they are part of a locked group [23:20:56] I'm more interested in knowing why that group is locked [23:21:40] although that is a helpful indicator that whoever locked it did not really think it through [23:22:04] I'd ask chase. I'm not saying he made the change but I don't think hardly anyone has access to that "admin" user account [23:23:53] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0] [23:44:15] legoktm: can you add tox check to operations/software/tools-webservice when you have the time? [23:45:10] yuvipanda: file a bug under CI-Config and assign it to me? [23:45:25] legoktm: sure