[03:52:02] petan: I forgot to make a /paste/ temporary, is there an easy way to override a stupidity? doesn't matter if there isn't [04:19:04] sDrewth: What's the ID? [04:31:52] !log tools "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all hosts in support of bug #70076 [04:31:55] Logged the message, Master [04:33:45] 3Wikimedia Labs / 3Infrastructure: Internal DNS look-ups fail every once in a while - 10https://bugzilla.wikimedia.org/70076#c4 (10Tim Landscheidt) I ran "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all Tools instances to get an idea of the order of magnitude between tools-master, tools-web... [04:49:16] urk, too slow [04:54:28] which? [08:48:32] Why is https://tools.wmflabs.org/wikidata-terminator/ a 404? Is that a temporary thing? [08:51:14] filed as https://bitbucket.org/magnusmanske/wikidata-todo/issue/3/wikidata-terminator-404 [10:08:04] lest anyone forgot, the dns periodically goes kaput issue is still real. using my cronspam as a metric [10:08:15] hosts that can't resolve their own names [10:25:40] is it tracked anywhere? [10:27:11] good question [10:29:10] godog, sounds like https://bugzilla.wikimedia.org/70076 [10:35:50] jeremyb: yep see comment #3, likely that [11:31:19] Ima doing a real DNS server for labs this week. [11:31:51] The root issue is "dnsmasq is the worst piece of shit trying to pass as a DNS server" [11:48:42] hah [12:54:00] <_joe_> FYI, I'm going do shortly deploy an apache change to beta [12:54:38] <_joe_> I'll probably roll it back anyways, but if you notice something not working in beta in the next half hour, please let me know [13:22:58] * Coren hates dnsmasq even more than before now. [13:23:12] dnsmasq also answers DHCP. That's part of its job. [13:24:06] What's hateful: it always responds domain-name-servers pointing to itself. This does not appear to be configurable. [13:57:04] 3Wikimedia Labs / 3Infrastructure: Internal DNS look-ups fail every once in a while - 10https://bugzilla.wikimedia.org/70076#c5 (10Marc A. Pelletier) a:3Marc A. Pelletier Replacing dnsmasq is... more complicated than reasonable because of the way it's being invoked and managed by Openstack. As a first pas... [14:08:50] Coren, aren't we committed to dnsmasq since it's integrated with nova-network? Or is that configurable? [14:09:27] I was hoping to /add/ a dns server in front of it; but dnsmask also focibly points clients at itself. :-( [14:09:41] https://gerrit.wikimedia.org/r/#/c/157816/1 might help a lot though [15:27:09] !log deployment-prep https://en.wikipedia.beta.wmflabs.org/ returning ERR_CONNECTION_REFUSED (is varnish down?) [15:27:13] Logged the message, Master [15:29:31] <_joe_> bd808: uh? I might have deployed a varnish change today [15:29:33] !log deployment-prep `curl -vL -H 'Host: en.wikipedia.beta.wmflabs.org' localhost` works from deployment-cache-text02 [15:29:36] Logged the message, Master [15:30:03] _joe_: Seems to work from directly on varnish box, but not outside world [15:30:39] <_joe_> oh, well, this cannot be related to my change [15:30:51] <_joe_> bd808: the wmflabs.org proxy is down, maybe? [15:31:22] <_joe_> bd808: works for me... [15:31:42] <_joe_> I mean I can reach beta just fine [15:32:32] weird. Routing problem? I can't load it from a incognito browser session and gi11es reports the same issue [15:32:56] * bd808 runs a treceroute [15:33:07] *traceroute [15:33:42] <_joe_> bd808: let me try again [15:34:06] http://pastebin.com/ScbbRyq8 [15:34:07] <_joe_> it works just fine [15:34:13] _joe_: It was the URL. gi11es was using https which is borked in beta [15:34:23] ah, I keep forgetting about that [15:34:24] <_joe_> and I'm on one of the crappiest mobile 3G networks in the world [15:34:33] <_joe_> oh ok [15:34:36] all those urls are in my browser's autocomplete [15:34:40] !log deployment-prep False alarm. SSL is borked in beta and we know that [15:34:43] Logged the message, Master [15:34:51] <_joe_> you know, I deployed the same change in prod 40 minutes ago [15:35:02] <_joe_> I was about to die in terror :) [15:35:26] my bad [15:35:33] Meh. We will break the wikis in some other way today :) [15:35:52] The ssl thing is annoying [15:36:18] We should just configure a self-signed cert and be done with it [15:37:16] https://bugzilla.wikimedia.org/show_bug.cgi?id=68387 [15:38:47] 3Wikimedia Labs / 3deployment-prep (beta): beta labs no longer listens for HTTPS - 10https://bugzilla.wikimedia.org/68387 (10Bryan Davis) [15:41:44] Coren: What database should I put CORE metadata into? (~14GB) [15:42:03] "CORE"? [15:42:14] Coren: The open access journal thing. [15:42:43] a930913: Well, unless you expect that joins between it and project databases are likely, tools-db is the right spot. [15:43:14] (aka tools.labdsb) [15:43:48] Coren: Danke. [15:44:13] Coren: Who was it who was doing citoid, that I talked to the other day? [15:44:32] ... sorry, I can't recall offhand. [15:44:42] Began with a J. [15:44:54] James_F|Away ? [15:45:08] I recall him talking about it. [15:45:13] Hmm, might have been. Thanks anyway. [15:58:54] Coren: Where is the docs for the database naming conventions and all? [16:00:52] You mean https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Creating_new_databases ? [16:03:10] Coren: That's the one :) [16:06:38] a930913: Yes, I'm the one to talk about citoid with. [16:35:48] 3Wikimedia Labs / 3wikitech-interface: Can't reset password on wikitech (Unicode passwords not accepted) - 10https://bugzilla.wikimedia.org/56114#c3 (10Chris Steipp) Wikitech is the only wiki (iirc) backed by ldap authentication. I'm going to guess either the extension or the ldap server doesn't handle the n... [16:38:08] James_F: Would CORE metadata on tap be of interest to you? [16:39:13] a930913: CORE? [16:39:24] a930913: You mean Dublin Core or something else? [16:39:50] James_F: The open access stuff. [16:56:27] Coren: Should I use qlogin, instead of buring a core on tools-login? [16:57:12] 3Wikimedia Labs / 3wikitech-interface: Can't reset password on wikitech (Unicode passwords not accepted) - 10https://bugzilla.wikimedia.org/56114#c4 (10jeremyb) (In reply to Chris Steipp from comment #3) > Does anyone in ops know what ldap server is used? See e.g. bug 63717. which indicates opendj [16:57:30] a930913: why not just jsub the thing? Otherwise, you can do it on tools-dev [16:58:16] Coren: Because I'm bashing stuff out :p tools-dev is a good idea. [16:58:25] I stopped using that after migration. [16:58:42] 3Wikimedia Labs / 3wikitech-interface: Can't reset password on wikitech (Unicode passwords not accepted) - 10https://bugzilla.wikimedia.org/56114#c5 (10Marc A. Pelletier) It's opendj. [17:12:53] andrewbogott: Turning on nscd hosts caching has reduced the traffic to dnsmasq by an order of magnitude; most of what's left is external now. [17:13:03] great! [17:13:30] Good idea :) [17:20:33] And external, we /could/ slave to the real servers. In fact, we should. [17:20:44] (And have /them/ be NS) [17:21:27] (Having the labs dns break because someone outside the network decides to hammer lightly on it remains... suboptimal). [17:25:02] Coren: which real servers are you referring to above? [17:26:52] bblack: ns[0-2].wm.o Right now, external requests for wmflabs.org go to labs-ns[01].wm.o which is a poor little shitty dnsmasq; having ns[0-2] cache for it would protect it (especially since 99% of the requests are for the same half-dozen A records) [17:27:32] ns[0-2].wm.o don't having any ability to cache, or to do zone transfers in either direction [17:27:40] Ah, bah. [17:28:44] we could fake it out with a powerdns cache sitting in front of dnsmasq, maybe. it would need to pretend it was authoritative when it wasn't, though. [17:28:44] I'm not sure if it has config to do that, but maybe. [17:29:57] 3Wikimedia Labs / 3wikitech-interface: Can't reset password on wikitech (Unicode passwords not accepted) - 10https://bugzilla.wikimedia.org/56114 (10Tim Landscheidt) a:5Ryan Lane>3None [17:30:27] bblack: That'd work. I know unbound can do that, but I'd rather not introduce a new dns server if we can avoid it. [17:35:46] a930913: Possibly. [17:36:15] andrewbogott: the topic would not explain bastion servers being unresponsive from time to time, would it? [17:36:40] gifti2: nope! [17:36:40] 3Wikimedia Labs / 3tools: Create postgresql user databases on request - 10https://bugzilla.wikimedia.org/63382#c20 (10Tim Landscheidt) 5PATC>3RESO/FIX Having requests as additional comments on an eternal bug seems suboptimal. Let's close this one and create new ones for new requests. After the move awa... [17:36:40] Although, Coren may have just fixed that, if it was a DNS issue [17:37:15] tools-login says "-bash: fork: Cannot allocate memory". [17:37:38] Well, that's definitive... [17:37:42] I guess dns would only cause issues when connecting and not while connected? [17:37:49] gifti2: right. [17:37:53] Which bastions are troubling you? [17:38:07] bastion2 atm [17:38:33] 3Wikimedia Labs / 3tools: Add "--help" parameter to "become" - 10https://bugzilla.wikimedia.org/62710 (10Tim Landscheidt) 5PATC>3RESO/FIX a:5Marc A. Pelletier>3Tim Landscheidt [17:39:15] it works this very sec though [17:40:55] gifti: All I would do in that case (which you can do as well) is run 'top' and see if the box is super busy [17:41:04] Which, if it is, it means someone is running something inappropriate there. [17:41:13] hmm [17:42:03] andrewbogott: Not knowing if that's the case here, but in the past I thought about setting a per-user resource limit that's enough for some ssh tunnels, but not more (on the Labs bastions, that is). [17:42:06] (03PS1) 10Ynhockey: Fixing Ukraine commonscat issue [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/157846 [17:43:30] bblack: Then again, if we're setting up a powerdns cache it might be simpler to just set it up as authoritative slave. [17:43:56] (03PS4) 10Tim Landscheidt: DO NOT COMMIT: Test linters [labs/toollabs] - 10https://gerrit.wikimedia.org/r/153625 [17:44:10] ... provided dnsmasq even /implements/ zone transfers which, honestly, I wouldn't bet on. [17:47:44] Even my "top" won't start at tools-login. *argl* [17:48:06] scfc_de: Huh? WFM and the load is light. [17:49:50] "top" just stalls (with no output), and even "uptime" returns immediately without any output. [17:50:33] (And if that was an artifact of cancelling top, a new login stalls again.) [17:50:42] Okay, now it works. Heisenbugs. [17:51:20] scfc_de: Might be networky on your side; because everything is fast and snappy from here. [17:51:35] * Coren notes with displeasure that there are new bots running on -login. Again. [17:51:35] tools login and def verry low :/ [17:51:59] Coren: "-bash: fork: Cannot allocate memory" wasn't on my side :-). [17:52:24] scfc_de: ... what? You got that on -login? [17:53:18] Yep. But my search-karma is bad to see in syslog if any processes were killed in the past few minutes. [17:53:35] scfc_de: Ohwait; are you going through a bastion? Because /they/ have been having memory issues. [17:53:49] andrewbogott: re: connection issues, researchers in -research are also complaining of network ssh issues to bastion in prod [17:53:49] halfak: ^ [17:53:49] scfc_de: And no, I just checked -- no OOM killers in the logs. [17:54:07] scfc_de: And ram usage hovers around 55% [17:54:42] Coren: Not for tools-login.wmflabs.org. [17:55:25] beta labs seems really slow atm [17:55:29] hey, i pushed a change for the first time ever and i think it needs to be reviewed. does every change need to be reviewed, even for non-WMF projects? (the project is called tools/heritage) [17:55:32] Coren: And the fork message appeared between "Last login: " and "scfc@tools-login:~$", so definitely from tools-login. [17:55:37] Coren: http://ganglia.wikimedia.org/latest/graph_all_periods.php?me=Wikimedia&m=cpu_report&r=2hr&s=by%20name&hc=4&mc=2&g=network_report&z=large also says all of wikimedia had a spike of several *petabytes* in the last few minutes? [17:56:09] Ynhockey: That depends on the repository. What's the URL for the change? [17:56:35] yuvipanda: That's downright *insane* [17:56:35] https://gerrit.wikimedia.org/r/157846 [17:56:35] Coren: indeed [17:57:27] Coren: and all from labs [17:57:28] aha [17:57:31] Coren: http://ganglia.wikimedia.org/latest/?c=Virtualization%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [17:57:41] Coren: 500 petabytes a sec?! [17:57:41] wtf [17:58:43] yuvipanda: There isn't that much fiber to go around. [17:58:43] Coren: indeed [17:58:43] although I suppose it isn't 'per second' but something larger, time unit wise [17:58:43] still [17:58:43] (And no, those'd be petaBITS/s. Still crazy insane) [18:00:41] yuvipanda: Seriously though, I'm pretty sure it's the instrumentation that went boom. [18:00:46] that 500 PB/s has to be a measurement fluke, yes [18:00:53] heh [18:01:02] or we might've seen singularity [18:01:03] memory only does tens of GB/s [18:01:22] this is 7 orders of magnitude more [18:02:13] so unless WMF has ordered 20 million servers from our donations... ;-) [18:02:47] 4294967296 is 32 bits. I'm guessing some graphite collection got some small negative number of MB/s (because bug) and just overflowed into an unsigned. [18:05:17] a lot of counters like that actually are ongoing, and then the monitoring portion has to make a delta [18:05:44] like bw for cisco stuff always rolls over after a certain time so it can make for odd edge cases like that [18:05:46] Ynhockey: Then you probably need to ask Multichill to okay that. Easiest way is to enter "Multichill" in the box and press "Add Reviewer"; you can also ping him in IRC or on wiki if you like. [18:05:51] idk if related but I have had it happen [18:07:04] There was another spike shortly after 18:00Z in Ganglia. [18:08:08] Oh, look. We're doing petabits/s again again. [18:08:15] * Coren tries to figure out wth is going on. [18:09:56] That seems to be correlated with work andrewbogott is doing in and around virt1000 (which is the box that reports the crazy) [18:10:24] Seriously, just changing a vhost is causing that? [18:11:05] andrewbogott: I doubt it. Perhaps puppet is doing evil things to diamond though. [18:11:20] Coren: fwiw, both beta labs and Jenkins seem to be slow for me right now [18:11:48] chrismcmahon: I see no real increase in traffic or load. [18:12:03] At least not globalized. [18:12:45] <^d> gerrit's been slow for me but it tends to lag when labs ldap's slow/unreachable [18:15:35] Coren: ganglia is separate from diamond, I think [18:17:31] Coren: diamond is accurate-r [18:17:31] http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1409681839.02&target=servers.virt1000.network.eth0.tx_byte.value&target=servers.virt1000.network.eth1.tx_byte.value&target=servers.virt1000.network.eth2.tx_byte.value&target=servers.virt1000.network.eth3.tx_byte.value&target=servers.virt1000.network.eth4.tx_byte.value&target=servers.virt1000.ne [18:17:31] twork.eth5.tx_byte.value&target=servers.virt1000.network.eth6.tx_byte.value [18:17:39] stupid stupid graphite [18:18:04] http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1409681839.02&target=servers.virt1000.network.*.tx_byte.value [18:18:36] is better [18:18:36] chasemp: Coren ^ [18:19:02] yuvipanda: Clearly. And those peaks are perfectly coherent with puppet runs. [18:19:11] yeah [18:19:47] Wait, that's not the scale I was thinking of. Hm. Well, that's a real peak right there, but it doesn't match the crazy nor is it currently ongoing. [18:19:49] (03CR) 10Multichill: [C: 032 V: 032] "Great, you got git/gerrit working :-)" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/157846 (owner: 10Ynhockey) [18:25:17] Coren: w00t, labmon.wmflabs.org is receiving data now! \o/ [18:25:25] shall switch it to graphite. in a moment [18:28:52] Coren: I'm about to run puppet again [18:32:12] 3Wikimedia Labs / 3tools: Set up lint checks for labs/toollabs - 10https://bugzilla.wikimedia.org/63687#c5 (10Tim Landscheidt) 5PATC>3ASSI (The PHP check is up and running, the JavaScript part needs another round.) [18:36:32] andrewbogott: There's definitely a correlation, that made another peak. How odd. [18:37:26] puppet is skynet [18:37:37] hmm, I suppose it controls enough machines now? [18:37:47] does puppet run ntpd maybe? [18:37:49] has anyone ever audited puppet source? [18:38:08] or does it just make the computer's heart skip a beat? [18:48:10] scfc_de: nabend, du online? [18:52:37] Steinsplitter: Yep. [18:56:27] 3Wikimedia Labs / 3Infrastructure: Internal DNS look-ups fail every once in a while - 10https://bugzilla.wikimedia.org/70076#c8 (10Tim Landscheidt) 5PATC>3ASSI I reset the counters at 16:15Z because after the merge data from before and after is hard to compare :-). I suggest that we revisit this after T... [19:07:05] Sample: 4Hi [19:07:12] Sample: 4anyone can help me? [19:08:27] that was interesting [19:46:46] (03Abandoned) 10Tim Landscheidt: DO NOT COMMIT: Test linters [labs/toollabs] - 10https://gerrit.wikimedia.org/r/153625 (owner: 10Tim Landscheidt) [19:47:35] hm, seems to be all fine again ... [20:01:37] !log puppet-compiler why is instance 02 taken offline and the jobs are pending? [20:02:13] !log puppet-compiler puppet-compiler is actually in some other project names .. whatevs [20:02:32] !log no log bot [20:36:58] bd808: can you check if commit a2bbfc7fdd14a04a0cd246b5a73519fb40625609 made it into puppet master on beta? [20:37:37] yuvipanda: You can check. deployment-salt /var/lib/git/operations/puppet [20:37:42] ah ok [20:37:43] doing [20:38:52] bd808|deploy: seems to be. I shall debug. thanks for the pointer [20:39:37] yuvipanda: Ha. Now you have been tricked into know where puppet lives in beta :) [20:40:04] hehe [20:42:39] bd808|deploy: when you've the time - how do I use salt to force a puppet run on all deployment-prep machines? [20:43:50] `salt '*' -b 2 cmd.run puppet agent -tv` should do it 2 hosts at a time [20:44:20] cool [21:34:25] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Example_configurations is not helpful at all: what's considered root for a tool? [21:36:32] Nemo_bis: I don't get your question. [21:36:36] Nemo_bis: root in what sense? [21:37:25] Nemo_bis: lighttpd gets the full path from the proxy, if that's what you mean, i.e. /projectname/some/path [21:37:40] that doesn't seem to work for me [21:38:52] Nemo_bis: *what* doesn't work. what are you trying. what is the result you expect. what is the result you get. [21:39:18] valhallasw: in my tests for url.redirect, I had to match /some/path (or perhaps some/path) and then redirect to /projectame/whatever [21:40:04] Nemo_bis: that sounds about right [21:40:18] the match is a regexp, so /some/path will also match on /projectname/some/path [21:40:57] Nemo_bis: but matching ^/projectname/some/path$ is the cleaner option, I'd think [21:41:10] I said *had to* [21:41:32] meaning that ^/projectname didn't match [21:41:56] that's surprising [21:42:00] have you tried adding debug.log-request-handling = "enable" ? [21:42:07] no [21:42:33] that will give you lengthy debug info in your error.log [21:45:26] I'm already over budget in terms of time consumed for this redirect [21:45:33] And I'm heading to bed now [21:45:53] But if that section is expanded to cover what's expected behaviour I may test more in the future [22:06:59] bd808: iirc, you mentioned that backends are added to varnish manually for the beta cluster. is that right? [22:07:55] marxarelli: Yeah. Let me find the stuff you'll want to change [22:08:44] bd808: i'm looking at wikimedia_text-backend.vcl and text-backend.inc.vcl [22:10:52] bd808: was about to add the new backend but then i saw "List of Puppet generated backends" and thought i'd double check that puppet wasn't going to overwrite my changes [22:11:44] marxarelli: Look at [22:12:32] bd808: oh, roger that [22:14:02] That is where the current servers get "added to the pool" for the varnish configs [22:53:13] 3Wikimedia Labs / 3wikistats: New Hive - 10https://bugzilla.wikimedia.org/70308 (10Robert Hanke) 3NEW p:3Unprio s:3normal a:3None http://fr.rodovid.org/wk/Special:Statistics (22 Wikis) [22:53:59] 3Wikimedia Labs / 3wikistats: New Hive - 10https://bugzilla.wikimedia.org/70309 (10Robert Hanke) 3NEW p:3Unprio s:3normal a:3None http://aero.orain.org/w/api.php Aero Wiki en http://agartha.orain.org/w/api.php Agartha en upload http://air.orain.org/w/api.php Auric Incursion en private http://allthetr... [22:54:58] 3Wikimedia Labs / 3wikistats: New Hive - 10https://bugzilla.wikimedia.org/70308 (10Robert Hanke) 5NEW>3ASSI a:3Daniel Zahn [22:55:13] 3Wikimedia Labs / 3wikistats: New Hive - 10https://bugzilla.wikimedia.org/70309 (10Robert Hanke) 5NEW>3ASSI a:3Daniel Zahn [22:55:41] ASSI: isn't ideal when translated to German :) [22:55:47] andre__ ^:) [23:00:10] mutante: https://github.com/valhallasw/pywikibugs/issues/25 :P [23:43:26] 3Wikimedia Labs / 3deployment-prep (beta): Setup a mediawiki03 (or what not) on Beta Cluster that we can direct the security scanning work to - 10https://bugzilla.wikimedia.org/70181#c5 (10Dan Duvall) The new deployment-mediawiki03 instance is fully provisioned, and I've cherry picked the varnish patch on de... [23:46:33] bd808: ^ seems to be working right. verified the cookie condition w/ tcpdump on mediawiki03 [23:49:29] 3Wikimedia Labs / 3tools: Install php5-xdebug on tool labs - 10https://bugzilla.wikimedia.org/70313 (10Kunal Mehta (Legoktm)) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier Please install php5-xdebug on tool labs. Thanks! [23:49:41] 3Wikimedia Labs / 3deployment-prep (beta): Setup a mediawiki03 (or what not) on Beta Cluster that we can direct the security scanning work to - 10https://bugzilla.wikimedia.org/70181#c6 (10Sherif Mansour) Thanks Dan, will take a look tomorrow and test it, what is the url and domain I should hit? [23:50:43] 3Wikimedia Labs / 3deployment-prep (beta): Unable to log in to beta labs on iOS devices (mobile web) - 10https://bugzilla.wikimedia.org/70145#c5 (10Maryana Pinchuk) It's constant now :( And yes, I just repro'ed on desktop Safari, too (good news is desktop Chrome is unaffected). [23:53:56] 3Wikimedia Labs / 3deployment-prep (beta): Unable to log in to beta labs using Safari - 10https://bugzilla.wikimedia.org/70145#c6 (10Greg Grossmeier) p:5Unprio>3Highes Changing summary accordingly. Chris or anyone with a Macbook: can you reproduce? [23:56:57] marxarelli: I see your ping here now. That's awesome. [23:58:13] bd808: it's always satisfying when a puppet apply goes well [23:58:41] bd808: re your comment about failing in prod, i'm not sure how to work around that [23:59:19] That's why I added b.black to the review. I'm not sure I know either. [23:59:36] Maybe send to the normal backends there?