[02:05:30] 3Wikimedia Labs / 3Infrastructure: Replica MySQL: Wiki ViewStats databases completely missing! - 10https://bugzilla.wikimedia.org/71043#c8 (10Sean Pringle) This finished loading over the weekend and should be back to normal. Double check? [02:13:45] 3Wikimedia Labs / 3tools: deletion queries joined with tokudb replication tables are really slow - 10https://bugzilla.wikimedia.org/68918#c3 (10Sean Pringle) MariaDB 10.0.14 was released late last week, and change log says the TokuDB fix for slow DELETE is included. Doing a test build today... [05:23:44] !log deployment-prep Created deployment-redis01 and converted it to use local puppet & salt masters [05:23:48] Logged the message, Master [05:33:57] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools-exec-08.diskspace._var.byte_avail.value (11.11%) [05:54:46] bd808|BUFFER: morebots was fixed? [05:55:05] and maybe specify which one next time, there's a lot of them [05:55:09] labs-morebots [05:55:09] I am a logbot running on tools-exec-14. [05:55:09] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [05:55:10] To log a message, type !log . [06:35:57] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [06:57:46] !log deployment-prep Created deployment-redis02 and converted it to use local puppet & salt masters [06:57:49] Logged the message, Master [06:58:01] !log deployment-prep Configured Beta cluster to use redis for session storage [06:58:04] Logged the message, Master [08:12:47] 3Tool Labs tools / 3[other]: Catscan2 offline - 10https://bugzilla.wikimedia.org/71402 (10Fæ) 3NEW p:3Unprio s:3normal a:3None Catscan2 appears to not have been available for a week. For me at least, this is a critical tool for housekeeping on Commons which I have no easy replacement for. [08:39:17] hashar: morning. i switched labs over to use redis for session storage, to match prod [08:39:28] i provisioned deployment-redis02 for that purpose [08:40:00] ori: you are awesome [08:40:12] ori: I noticed a related change for GettingStarted :] [08:40:28] Matthew added some hack in https://gerrit.wikimedia.org/r/#/c/163547/ :D [08:40:52] ori: definitely announce your change on the QA list ! [08:41:04] OK, will do [08:42:47] ori: and please please head bed after :] [08:57:16] !log integration rebased puppetmaster [08:57:19] Logged the message, Master [12:44:01] 3Wikimedia Labs / 3deployment-prep (beta): deployment-mediawiki02 and mediawiki03 have puppet disabled - 10https://bugzilla.wikimedia.org/71410 (10Antoine "hashar" Musso) 3NEW p:3Unprio s:3normal a:3None The instances deployment-mediawiki02 and deployment-mediawiki03 have puppet disabled. That preve... [13:32:19] 3Wikimedia Labs / 3deployment-prep (beta): hhvm creates core file in /tmp/ filling mediawiki02 labs instance root partition - 10https://bugzilla.wikimedia.org/69979#c19 (10Marc A. Pelletier) While I can think of a number of good reasons why you'd want to keep cores in a development environment, would you eve... [13:35:00] 3Wikimedia Labs / 3tools: Provide namespace IDs and names in the databases similar to toolserver.namespace - 10https://bugzilla.wikimedia.org/48625#c47 (10nosy) Latest plan is to implement a view called toolserverdb that points to the original db. It would regard: s51892_toolserverdb_p.language s51892_tools... [13:38:59] 3Wikimedia Labs / 3deployment-prep (beta): hhvm creates core file in /tmp/ filling mediawiki02 labs instance root partition - 10https://bugzilla.wikimedia.org/69979#c20 (10Ori Livneh) (In reply to Marc A. Pelletier from comment #19) > While I can think of a number of good reasons why you'd want to keep cores... [13:59:28] yuvipanda35: You have quite a lot of instances with stale puppet -- can you take some time today and catch them up? Complete list here: https://wikitech.wikimedia.org/wiki/Ldap_rename [14:01:22] 3Wikimedia Labs / 3tools: Provide namespace IDs and names in the databases similar to toolserver.namespace - 10https://bugzilla.wikimedia.org/48625#c48 (10Marc A. Pelletier) 5ASSI>3RESO/FIX (The view is named toolserverdb_p to allow all users select right, at it lives on tools.labsdb). That has been cre... [14:03:59] matanya, _joe_, is the puppet3-diffs project still useful? [14:04:16] andrewbogott: it is, unless someone broke it [14:04:29] great -- do you mind rebasing their puppet repos? [14:04:39] they need the ldap updates at least [14:04:57] I don't think i have rights for that [14:05:17] andrewbogott: do you mean pick the changes ? [14:05:26] matanya: either way, rebase or cherry pick [14:05:29] depending on what's appropriate. [14:05:42] If I'm doing it myself I'll probably just rebase [14:06:20] andrewbogott: you can: https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/ [14:06:38] you know how ? [14:07:08] what, integration depends on instances that aren't puppetized? :( [14:07:26] hashar: ^ ? [14:07:34] it is a frontend to the compiler [14:07:54] what is the gerrit change id andrewbogott ? [14:08:02] andrewbogott: integration is fullly puppetized [14:08:11] and which nodes do you want to test ? [14:08:26] andrewbogott: but we rely on some instances for Wikidata which are using puppetmaster::self and needs a manual rebase. They are managed by wmde [14:08:41] oh, i see what you requested [14:08:57] i hate mondays, my brain is broken on them [14:09:17] hashar: I'm talking about puppet-compiler01 and puppet-compiler02 [14:09:27] andrewbogott: ah those are managed by _joe_ :-] [14:09:45] and might indeed not be fully puppetized [14:10:35] * andrewbogott prefers 'managed by puppet' to 'managed by ' [14:10:52] andrewbogott: where is the puppet tree? i'll rebase [14:10:58] /var/lib/puppet ? [14:11:11] Out of ~400 running instances, more than 140 of them had hacked and broken puppet. Something is broken in labs workflow, clearly :( [14:11:17] matanya: /var/lib/git/operations/puppet [14:11:18] should be [14:11:49] matanya: thanks! [14:12:11] I'm hoping that the proper function of those instances doesn't rely on them having puppet disabled... [14:12:15] sure, what revision should o rebase to ? [14:12:35] matanya: I would just fetch origin; rebase origin [14:14:10] done on 01 [14:14:25] andrewbogott: 02 doesn't have /var/lib/git :/ [14:14:45] matanya: ok, look at /etc/puppet.conf. Probably it was using 01 as its puppetmaster [14:14:49] in which case you're already done :) [14:15:17] um… or /etc/puppet/puppet.conf? I can't remember which [14:15:58] the latter [14:16:01] server = virt1000.wikimedia.org [14:16:08] oh, then... [14:16:16] then puppet runs are just broken :) I'll look [14:16:32] yes, they are broken on this host [14:16:48] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'reason not specified'); [14:16:52] oh, it's disabled! [14:17:03] So this really is a problem for _joe_ [14:17:17] i think you should talk to _joe_ before moving forward [14:18:35] matanya: do you by chance a) know a bit about DNS and b) have time to help me with some testing? [14:19:24] andrewbogott: a) i have just refactored my company dns the last two weeks, so yes, bind to death :) b) in 5 minutes or so [14:20:29] matanya: ok, five minutes isn't enough for the whole process, but my first question is: Can you teach me how to test the functionality of a specific dns server? Basically confirm that a lookup is working via e.g. labcontrol2001.wikimedia.org [14:21:05] andrewbogott: in 5, means 5 minutes from now :) [14:21:15] oh! Better yet! [14:33:45] ok, i'm here andrewbogott [14:34:26] matanya: well, I've already found one broken thing before I even ran a test :) [14:34:32] But, ok, here's the big picture... [14:34:57] right now labs-ns0 is the same box as virt0 and labs-ns1 is the same box as virt1000 [14:35:13] I need to change things such that labs-ns0 uses virt1000, and labs-ns1 is labcontrol2001 [14:35:19] Ideally without causing a dns outage :) [14:35:44] So first I need to make sure that labcontrol2001 is actually function as a dns server... [14:36:43] nslookup wikipedia.org nameofyourserver [14:37:01] e.g nslookup wikipedia.org labcontrol2001 [14:37:23] or nslookup wikipedia.org labs-ns0 [14:37:35] I'm going to create a temporary labs-ns2 alias... [14:37:38] depends on what is the dns server name of course [14:45:37] ok, matanya, https://gerrit.wikimedia.org/r/163573 and https://gerrit.wikimedia.org/r/163574 [14:47:39] ok, this looks ok [14:48:11] I'm not positive that that's the whole story to get pdns up, but it's a good start [14:48:52] i know nothing about pdns, iirc you have a dns master in house named bblack no? :D [14:50:38] yeah, but if I ask him for help with labs he'll just say "Don't do it that way!" [14:52:30] ok, now we have a dns server running on labcontrol2001, let's see... [14:55:36] curious... [14:56:03] I guess I don't know precisely what labs-ns1 and labs-ns0 are for… I can't actually get useful answers from either of them. Can you? [14:57:32] Coren: I have a simple and dumb question: What are labs-ns0 and labs-ns1 used for? Lookups /about/ labs, or lookups from inside labs? Or something else? [14:58:27] Both, really. dnsmasq allocates dhcp and then stuffs the lease in its "zone file"-equivalent. They are also recursors for labs proper (but they have no /need/ to be) [14:59:04] so, what's an example of a test that I can run to verify that labs-ns1 is up and running and talking to the ldap backend? [15:00:26] Well, I guess the answer is "nslookup tools-login.wmflabs.org labs-ns1.wikimedia.org" [15:00:30] and also labs-ns1 is down :( [15:02:50] Ah, there we go, it's back. [15:08:14] matanya: what port should I be checking in my firewall? I can't get any dns service from labcontrol2001 although the server seems to be running [15:17:43] matanya and/or Coren, can you look at manifests/role/dns.pp? I suspect that the the ferm rules in role::dns::recursor need to also be present in role::dns::ldap -- does that seem right to you? [15:17:56] And, all of them, or just the ones for port 53? [15:17:57] * Coren looks. [15:18:59] Well, they're all for port 53. :-) [15:19:08] andrewbogott: 53 for sure [15:19:12] And yes, I expect you'd want those rules. [15:20:26] https://gerrit.wikimedia.org/r/#/c/163581/ [15:20:56] for the record: I have no idea what 'Skip DNS outgoing connection tracking' means [15:24:38] Hi everybody, I have a question [15:24:50] is there any online tool to show "all pages created by a user in Descending Order of their traffic statistics"? [15:25:17] 3Wikimedia Labs / 3Infrastructure: Unable to connect to redis server - 10https://bugzilla.wikimedia.org/71415 (10Amir E. Aharoni) 3NEW p:3Unprio s:3normal a:3None I tried to edit http://es.wikipedia.beta.wmflabs.org/w/index.php?title=P%C3%A1gina_principal , and got an error: ========================... [15:26:30] 3Wikimedia Labs / 3Infrastructure: Unable to connect to redis server - 10https://bugzilla.wikimedia.org/71415#c1 (10Chris McMahon) Created attachment 16622 --> https://bugzilla.wikimedia.org/attachment.cgi?id=16622&action=edit redis error [15:26:59] 3Wikimedia Labs / 3Infrastructure: Unable to connect to redis server - 10https://bugzilla.wikimedia.org/71415#c2 (10Chris McMahon) p:5Unprio>3Highes s:5normal>3major see screen shot from Flow [15:29:45] 3Wikimedia Labs / 3deployment-prep (beta): Unable to connect to redis server - 10https://bugzilla.wikimedia.org/71415 (10Chris McMahon) [15:44:53] andrewbogott: for general knowlegde: https://gerrit.wikimedia.org/r/#/c/134071/ [15:46:54] matanya: thanks! [16:26:49] andrewbogott: wondering, is it possible to run X on labs ? [16:30:16] matanya: yes; xvfb [16:30:24] thanks [16:31:30] 3Wikimedia Labs / 3deployment-prep (beta): Unable to connect to redis server - 10https://bugzilla.wikimedia.org/71415 (10Ori Livneh) 5PATC>3RESO/FIX [16:35:41] rules set in ferm should appear in iptables --list shouldn't they? [16:40:03] yes, they should [16:40:31] Hm, 53 is still not open on labcontrol2001 [16:46:00] weird, the ferm::rule class didn't actually install ferm. so I added base::firewall to my server, and that slammed shut a bunch of ports that were open before, and also seems to be ignoring the explicit ferm rules I set [16:54:34] I actually never yet had cause to use the ferm rules before, so I don't think I could tell you how they are supposed to behave except that "you seem to have done the same thing" [17:33:08] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (10.00%) WARN: tools.tools-exec-08.diskspace._var.byte_avail.value (100.00%) [17:33:28] chrismcmahon: beta labs seems to be down [17:33:55] kaldari|2: looking. Or i broke it earlier today [17:34:51] chrismcmahon: well, it's half back up now, just no CSS or JS. [17:35:23] kaldari|2: hmm, wfm, I can log in and edit, I'm seeing correct pages [17:35:31] kaldari|2: file a bug for it? [17:36:11] chrismcmahon: I get a ton of 404 errors [17:36:53] chrismcmahon: http://bits.beta.wmflabs.org/en.wikipedia.beta.wmflabs.org/load.php?debug=false&lang=en&modules=ext.echo.badge|ext.flaggedRevs.basic|ext.uls.nojs|ext.visualEditor.viewPageTarget.noscript|ext.wikihiero%2CwikimediaBadges|mediawiki.legacy.commonPrint%2Cshared|mediawiki.skinning.interface|mediawiki.ui.button|skins.vector.styles|wikibase.client.init&only=styles&skin=vector&* [17:42:31] 3Wikimedia Labs / 3deployment-prep (beta): Unable to connect to redis server - 10https://bugzilla.wikimedia.org/71415#c5 (10Chris McMahon) 5RESO/FIX>3REOP I'm still seeing errors about redis when trying to save an edit [17:44:08] there's some dns issue [17:44:10] see -operations [17:45:08] ^ labs-ns1 is probably giving out bad answers to the internet [17:48:40] * matanya points @ andrewbogott [17:49:28] I haven't actually done anything yet... [17:49:29] what's broken? [17:49:53] see -operations! :) [17:51:12] * matanya takes the finger back [17:51:16] short summary: labs-ns1 is giving out bad answer (like can't resolve bastion.wmflabs.org) to internet DNS caches. I stopped powerdns on virt1000 (aka labs-ns1) and disabled puppet so that we don't continue polluting. [17:51:36] labs-ns0 (virt0) seems to be giving more-correct answers [17:51:51] so it's not providing incorrect IPs, just failing? [17:52:10] it gave a NOERROR response in that case, which means it thinks it has some data for that hostname, just not an A-record [17:52:39] (which is a negative response to the requestor: it's saying there's no A-record on which that hostname is reachable) [17:52:44] most likely it just needed a restart… pdns gets upset when ldap restarts. [17:52:52] bblack: if I restart it will you help me verify w/not it's still broken? [17:53:07] I'm still there, I can try it and verify myself at the same time [17:53:29] it was running on virt1000 already... [17:53:44] it's working fine now [17:53:50] …and still? [17:53:55] I started it [17:55:43] Sep 29 15:03:06 virt1000 pdns[8451]: Error sending reply with sendto (socket=5): Invalid argument [17:55:46] Sep 29 15:03:06 virt1000 pdns[8451]: Error sending reply with sendto (socket=6): Invalid argument [17:55:52] those seem to be "normal" there, but sure seems fishy [17:56:27] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (11.11%) WARN: tools.tools-exec-08.diskspace._var.byte_avail.value (100.00%) [17:56:30] andrewbogott: i have disk space issues on my vm, any other solution other then a new machine/deleteing files ? [17:56:42] great timing [17:56:50] matanya: lvm? [17:57:08] labstore.svc.eqiad.wmnet:/project/video/home [17:57:23] !log deployment-prep updated OCG to version 89d8f29a24295b05d0643abe976fea83b56575c9 [17:57:26] this ldap restart -> powerdns gives caches around the world bad responses seems like a pretty ugly situation [17:57:26] Logged the message, Master [17:57:41] it is encoding01.eqiad.wmflabs [17:57:52] i'm encoding very large files [17:58:03] and they drink space like water [17:58:09] matanya: What I mean is: Do you know about the puppet lvm classes? [17:58:12] To manage partitions on a labs box? [17:58:16] we should probably at least add an icinga check for it (that queries labs-ns1 for some wmflabs.org hostname that comes from LDAP) [17:58:26] I do, andrewbogott [17:58:42] well, I guess we do already, just it was broken and nobody cared [17:58:56] matanya: instances cannot be resized. So, you can repartition, or use /data/project, or build a new bigger instance [17:59:18] ok, thanks, i'll add new images. [18:00:05] oh, actually, the problem is more insidious than that [18:00:12] i should pupptize thsi setup, getting comlicated. Thanks for the help andrewbogott [18:00:17] the nagios check says "OK" if it gets a valid response that contains no addresses [18:06:51] andrewbogott: thanks for fixing tool labs, or whoever fixed it [18:08:04] kaldari: next time feel free to ping me when the problem actually starts :) [18:10:09] will do [18:24:55] andrewbogott: still can't ssh to tools-login though: "ssh: Could not resolve hostname tools-login.wmflabs.org: nodename nor servname provided, or not known" [18:25:10] kaldari: what was fixed then? [18:25:41] andrewbogott: the rest of tool labs, i.e. the webserver being accesssible [18:25:57] kaldari: have an example url? [18:26:10] http://tools.wmflabs.org/ [18:26:12] tools-login.wmflabs.org resolves for me, I would've thought they'd be coupled for you [18:27:31] kaldari: what does nslookup tools-login.wmflabs.org tell you? [18:28:25] nslookup tools-login.wmflabs.org [18:28:25] Server: 192.168.39.5 [18:28:25] Address: 192.168.39.5#53 [18:28:25] [18:28:26] Non-authoritative answer: [18:28:26] *** Can't find tools-login.wmflabs.org: No answer [18:29:08] hm, 192.168.39.5 must be your local router or something, huh? [18:47:54] aude: Are the wikidata instances mostly yours these days? There are quite a lot of wikidata-* instances in this list: https://wikitech.wikimedia.org/wiki/Ldap_rename [18:49:42] andrewbogott: I tried some other DNS servers but they all return "Can't find tools-login.wmflabs.org: No answer" [18:50:02] kaldari: google too? [18:50:08] yeah [18:50:46] hm [18:52:16] andrewbogott: Yup, dns is broken (again) [18:52:18] kaldari: same problem for bastion.wmflabs.org right? [18:52:25] andrewbogott: I actually do need to log into it to check the experimental WikiGrok data. Can you just give me the IP address in the meantime :) [18:52:39] tools-login is 208.80.155.130 [18:52:43] thanks [18:52:48] kaldari: Check your ssh history :P [18:53:04] ssh has history? [18:53:14] known_hosts, unless you encrypt it [18:54:36] yeah it's the same problem (re: DNS) [18:55:23] bblack: It works for me, and tools.wmflabs.org resolves on downforeveryoneorjustme [18:55:28] what's with that? [18:56:15] well the thing to check is directly against labs-ns[01] - that hostname works for me right now at both, but if it's an intermittent problem.... [18:56:28] it could also be from the exact incident earlier, as caches would remember that for an hour [18:56:50] (kaldari's first mention was still within that hour window from the earlier problem) [18:57:12] I suspect it's the cache [18:58:29] well, the cache remembering us doing something wrong, not the cache itself being faulty [18:58:46] Sure, that's what I meant... [18:59:01] But… there's nothing we can do about that, is there? [18:59:03] but relatedly (and this affects prod DNS monitoring as well), I'm gonna see if I can fix the authdns icinga check [18:59:20] because right now it says "OK" when our authservers return a response with no addresess in it (which is what happened in this LDAP scenario) [18:59:25] so we don't even get notified about it [18:59:28] yeah, that'd be good -- presuming that we aren't in the middle of an outage that we should be fixing [18:59:51] there's nothing we can do about the caching from the previous event, but the hour should be up roughly around now [19:08:32] kaldari: how's your dns? Any better? [19:22:42] andrewbogott: so we have to update puppet? [19:22:50] aude: yes please [19:22:57] ok [19:23:02] aude: For the full story, look for IMPORTANT thread on labs-l [19:24:31] ok [19:24:43] * aude going to update them now [19:24:47] or else i might forget [19:27:06] thanks [19:27:29] andrewbogott: seems there is a syntax error in /etc/puppet/manifests/role/phabricator.pp:89 [19:27:44] we didn't touch that [19:28:25] aude: merge conflict maybe? [19:29:49] no [19:30:33] i am on wdjenkins [19:32:13] trying another instance [19:32:22] same issue :( [19:33:28] aude: are you unable to rebase those instances? [19:33:43] That parse error could be due to a misplaced bracket pretty much anywhere [19:34:17] git pull --rebase origin master [19:34:21] no issues [19:34:35] aude: I would assume an edit I made? but I thought I had the jenkins ok [19:34:59] why is 'git diff origin' so enormous then? [19:35:47] also I can't rebase because It seems that there is already a rebase-apply directory [19:35:55] which makes it seem like your rebase didn't go so well? [19:35:56] trying again [19:36:01] tools-login.wmflabs.org unknown host? [19:36:05] that's why [19:37:51] Krinkle: we had a brief DNS outage, perhaps you still have an empty cache. I'd expect it to be fixed by now though [19:38:16] I flushed local dns on my computer [19:38:18] still not working [19:38:21] I'm in SF office [19:39:01] James and Roan couldn't connect either just now [19:40:43] Coren: can you take a moment to sort out what's happening with DNS in the office? I can't see the issue here at all. [19:40:49] or bblack? [19:43:30] Krinkle: do you have a linux/mac machine there in the office? [19:43:36] Mac [19:44:06] Krinkle: can you try first just "dig tools-login.wmflabs.org" on the cmdline on the mac? [19:44:26] I myself ran flushdns ('sudo killall -HUP mDNSResponder') on my computer and still no resolution. James and Roan tried it without manually flushing DNS first. [19:44:50] wmflabs.org. 125 IN SOA virt1000.wikimedia.org. hostmaster.wikimedia.org. 1411717753 1800 3600 86400 7200 [19:44:50] ;; Query time: 8 msec [19:44:50] ;; SERVER: 192.168.39.5#53(192.168.39.5) [19:45:12] so it's got 125s left in your local office cache on that bad lookup [19:45:13] ;tools-login.wmflabs.org. IN A [19:45:33] keep checking, the number should go down. what will be interesting will be what happens when it reaches zero [19:45:54] aude: sorted now? [19:47:19] Krinkle: what does it say now? [19:47:28] andrewbogott: i think git is updated correctly now [19:47:37] but still get the error [19:47:40] wmflabs.org. 73040 IN NS labs-ns0.wikimedia.org. [19:47:40] wmflabs.org. 73040 IN NS labs-ns1.wikimedia.org. [19:47:44] PING tools-login.wmflabs.org (208.80.155.130): 56 data bytes [19:47:57] so dig got the address? [19:47:59] aude: ok, I will look again shortly [19:48:03] tabs paste horribly :) [19:48:24] http://pastie.org/9605336 [19:48:34] wikidata-builder3 [19:48:44] it's possible there's some badly-configured chained caching going on that ignores TTLs and thus extends them a bit [19:48:55] see if anyone else in the office has problems hitting it from here forward [19:52:23] chasemp: What have you been doing about passwords::mysql::phabricator on labs? [19:52:53] nothing really, it's hardcoded in the maniphest in labs [19:52:59] so variables not used I guess? [19:53:06] but the puppet manifests don't work on labs then? [19:53:11] yuvi actually used my labs logic so we should ask him :) [19:53:12] they do [19:53:15] aude is seeing one error out, I presume because passwords are undefined [19:53:20] there is a phabricator::main and phabricator::labs [19:55:27] (03PS1) 10Andrew Bogott: Insert a bunch of dummy passwords for phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/163661 [19:55:47] chasemp: to quiet puppet: ^ [19:56:34] andrewbogott: what's the context of the error people are getting? are they trying to apply phab::main class or? [19:56:43] not opposed to labs stub out for passwords module [19:56:48] chasemp: dunno. aude? [19:56:51] but may be doing the wrong thing to begin w/ [19:57:13] trying to apply [19:57:20] puppet agent -tv [19:57:30] apply what? [19:58:56] http://pastie.org/9605379 [19:59:44] 3Wikimedia Labs / 3deployment-prep (beta): Unable to connect to redis server - 10https://bugzilla.wikimedia.org/71415#c6 (10Chris McMahon) for example, clicking Preferences: [fe58754d] /wiki/Special:Preferences Exception from line 827 of /srv/mediawiki/php-master/includes/jobqueue/JobQueueRedis.php: Unable... [20:02:33] aude: are you trying to get phabricator set up then? [20:02:45] I think you are applying the wrong class, the class for prod is going to have things that can't work in labs at all [20:03:25] or at one point an error like that did exist I think but is old, could this be an old master? [20:03:39] but again should be for phab::main only and not phab::labs? [20:03:48] no, not setting it up [20:04:09] only want to be able to run puppet [20:04:31] chasemp: it probably has to parse the file even if it isn't included... [20:04:40] yep [20:05:14] (03CR) 10Rush: [C: 031] "sure" [labs/private] - 10https://gerrit.wikimedia.org/r/163661 (owner: 10Andrew Bogott) [20:05:22] give that a whirl then ^ [20:05:28] otherwise not sure what's up [20:06:06] (03CR) 10Andrew Bogott: [C: 032] Insert a bunch of dummy passwords for phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/163661 (owner: 10Andrew Bogott) [20:06:24] aude: since you are on a local puppetmaster you will need to rebase your private repo as well. [20:06:37] it is in /var/lib/git/labs [20:07:04] ok [20:10:23] i get Permission denied (publickey). [20:10:38] is there some specific way to rebase that? [20:10:51] hm, let me try [20:11:08] ok [20:11:19] (03CR) 10Andrew Bogott: [V: 032] Insert a bunch of dummy passwords for phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/163661 (owner: 10Andrew Bogott) [20:11:52] aude: "sudo GIT_SSH=/var/lib/git/ssh git pull --rebase" [20:12:03] as per https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster#FAQ [20:13:16] ok [20:14:13] and still get the error [20:14:18] suppose we need to wait [20:16:36] (03PS1) 10Andrew Bogott: Revert "Insert a bunch of dummy passwords for phabricator" [labs/private] - 10https://gerrit.wikimedia.org/r/163667 [20:16:47] (03CR) 10Andrew Bogott: [C: 032 V: 032] Revert "Insert a bunch of dummy passwords for phabricator" [labs/private] - 10https://gerrit.wikimedia.org/r/163667 (owner: 10Andrew Bogott) [20:19:23] hey ori are you the only one looking at redis on beta labs right now? it still has problems, e.g. errors when clicking Preferences [20:21:20] aude: wait for what? [20:21:51] idk [20:22:03] i don't know what else to try [20:22:42] aude: if I clip out the file that's throwing the error it just inserts a syntax error someplace else. [20:22:50] Can you explain to me why that instance uses self-hosted puppet in the first place? [20:23:30] https://bugzilla.wikimedia.org/show_bug.cgi?id=71419 for one of them [20:24:55] except the puppetmaster there is == to the upstream isn't it? [20:25:43] it might be better for jan to work on this [20:28:44] Right now it includes the class wdjenkins::jenkins [20:28:48] and I don't see that defined anywhere [20:28:54] maybe that's some puppetception magic... [20:29:09] I'm going to just remove that class from the instance, I bet it'll work fine after that [20:29:18] ok [20:30:10] aude: you can point Jan to me tomorrow if needed [20:30:13] hm, oddly that did not help [20:31:27] puppet master is a bit of a mess :D [20:32:44] it might be better for us to make new instances [20:33:38] or jan might know what to do [20:37:32] Is beta getting updated regularly again? [20:41:14] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (11.11%) WARN: tools.tools-exec-08.diskspace._var.byte_avail.value (100.00%) [20:47:47] aude: that fixed the syntax issue I guess? [20:49:39] chasemp: no [21:15:39] How do I make a partition bigger on labs? [21:15:58] /dev/mapper/vd-second--local--disk 9.1G 8.6G 36M 100% /srv [21:16:36] partition/disk [21:20:08] "Allocate all of the instance's extra space as /srv" [21:20:25] Disk /dev/vda: 21.5 GB, 21474836480 bytes [21:20:31] /dev/vda1 8.1G 2.2G 5.5G 29% / [21:20:36] /dev/vda2 2.1G 696M 1.3G 37% /var [21:20:40] so it's "all the rest" [21:22:15] !log beta deployment-rsync01 hard drive is far too small [21:22:15] beta is not a valid project. [21:22:23] !log deployment-prep deployment-rsync01 hard drive is far too small [21:22:25] Logged the message, Master [21:24:46] !log deployment-prep deleted l10n cache on deployment-rsync01 to attempt to run sync-common manually [21:24:50] Logged the message, Master [21:26:18] Reedy: Maybe we should just build a bigger instance to replace rsync01? [21:26:32] 3Wikimedia Labs / 3deployment-prep (beta): deployment-rsync01 20GB hard drive is too small - 10https://bugzilla.wikimedia.org/71431 (10Sam Reed (reedy)) 3NEW p:3Unprio s:3normal a:3None Exactly what it says on the tin. It's causing auto code updates to break [21:26:34] It *should* be easy to [21:26:35] That's presumably the better option [21:27:35] * Reedy looks how big one version of mw is on tin [21:27:59] Make new host with bigger disk, join with beta puppet/salt, apply the same roles as rsync01, change the fake dsh files in the beta role to point to the new host [21:28:26] I'm presuming 20GB is "just" too small [21:28:31] It will be bigger in beta because it includes the full mediawiki-extensions repo [21:28:50] 4GB of ram seems a bit OTT [21:29:24] /dev/mapper/vd-second--local--disk 73G 12G 57G 18% /mnt [21:29:28] Yah. Just having a bigger vdish would be fine. I wonder if that's something that andrewbogott can make happen for an existing instance? [21:29:31] (from deployment-bastion) [21:29:32] *vdisk [21:29:42] and/or "another disk" [21:34:15] !log deployment-prep disabled "beta-scap-eqiad" until things are fixed [21:34:17] Logged the message, Master [21:35:13] no point it running if it's gonna fail/conflict with my manual run [21:35:29] Reedy: Better amy be to just take rsync01 out of the loop. That can be done by with a puppet change to make the other hosts sync with deplyment-bastion. [21:36:33] Reedy: Change this file to say "deployment-bastion.eqiad.wmflabs" -- https://github.com/wikimedia/operations-puppet/blob/production/modules/beta/files/dsh/group/scap-proxies [21:37:11] Then cherry-pick the patch on deployment-salt and force a puppet run on deployment-bastion.eqiad.wmflabs [21:37:25] Of just hot patch on deployment-salt as a temp [21:37:55] * bd808 jfdi's that [21:38:19] haha, was just making https://gerrit.wikimedia.org/r/163736 [21:38:50] bd808: I can't resize an existing instance -- you might still have available space to partition though [21:39:12] Reedy: Cool. Play on then. You might want to pull rsync01 from the mediawiki-installation too [21:39:19] andrewbogott: That is the remaning space :( [21:39:35] andrewbogott: We don't I don't think. It has the /srv secondary mount [21:40:20] ok, probably you need to start with a new bigger instance then [21:40:27] or use /data/project… if it isn't a database [21:40:45] bigger instance probably makes much more sense [21:41:10] using /data/project will make it all suck again. Bigger instance is the right answer. [21:41:20] Hi [21:43:09] Hmm, after killing hte l10n dir and sync-common [21:43:09] /dev/mapper/vd-second--local--disk 9.1G 5.6G 3.0G 66% /srv [21:43:19] Just failing on [21:43:20] rsync: rename "/srv/mediawiki/.agent.RHmqZr" -> ".~tmp~/agent": Permission denied (13) [21:43:20] rsync: rename "/srv/mediawiki/.wikiversions-labs.cdb.sJ0RUz" -> ".~tmp~/wikiversions-labs.cdb": Permission denied (13) [21:44:27] * Reedy deletes .~tmp~ [21:45:02] Was that a dir on deployment-bastion? [21:45:31] it's not in /srv/mediawiki-staging on deployment-bastion at least [21:45:40] so I'm going with no [21:46:02] 21:45:22 Finished rsync common (duration: 01m 08s) [21:46:05] * Reedy waits to see what happens [21:46:08] ok. maybe crap from a failed sync [21:46:16] brb, finding a drink [21:48:55] /dev/mapper/vd-second--local--disk 9.1G 6.6G 2.1G 77% /srv [21:48:58] * Reedy tries full beta scap [21:59:23] Hm… do we have a chjohnson in the house? [22:11:41] 22:00:52 Started sync-proxies [22:11:41] sync-proxies: 100% (ok: 1; fail: 0; left: 0) [22:11:41] 22:02:54 Finished sync-proxies (duration: 02m 01s) [22:11:42] yay [22:11:47] andrewbogott: I think he's out today [22:11:55] Reedy: ok, I emailed [22:12:10] And tomorrow [22:12:28] this is christopher.johnson@wikimedia.de? [22:12:55] He has a bunch of instances (in the 'scrumbugz') project with broken puppet. [22:13:19] oh, wait [22:13:29] I presumed you typoed on WMFs chris [22:13:43] Either way, he's possibly not here either due to the time [22:15:20] Not sure which of the wmf chris are meant ... :-) but for scrumbugz wmde feels right. [22:15:34] I brought it forward with them today. [22:16:03] They told me that he is not in the office today, but tomorrow, and will fix issues around ldap then. [22:17:34] qchris: ah, I see. OK, thank you [22:22:32] !log Wikidata-build updated wikidata-builder3.eqiad.wmflabs to current operations/puppet.git and apt-get see https://bugzilla.wikimedia.org/show_bug.cgi?id=71411#c4 [22:22:33] Wikidata-build is not a valid project. [22:23:03] * jzerebecki slaps labs-morebots [22:23:11] !log wikidata-build updated wikidata-builder3.eqiad.wmflabs to current operations/puppet.git and apt-get see https://bugzilla.wikimedia.org/show_bug.cgi?id=71411#c4 [22:23:14] Logged the message, Master [22:23:41] good bot! [22:27:19] chasemp: you still have a couple of doomed instances on https://wikitech.wikimedia.org/wiki/Ldap_rename#Instances_in_Danger [22:43:02] bd808: 22:28:23 Finished scap: (no message) (duration: 38m 54s) [22:43:15] People complain production scap is slow when it does more versions to more servers in less time? :) [22:45:46] !log deployment-prep re-enabled beta-scap-eqiad [22:45:48] Logged the message, Master [23:09:48] CUSTOM - Ok, mutante