[00:06:27] 10Tool-Labs-tools-Other: Fix tool kmlexport - https://phabricator.wikimedia.org/T92963#1339805 (10yuvipanda) I think this is clearly a problem with the tool itself - and the maintainers should look into it and see what they can do. [00:22:34] cluestuff@tools-login [00:22:34] Unrouteable address [00:22:43] a930913: !!! [00:22:51] a930913: why am I being told that you did not edit?? [00:22:59] legoktm: I was testing. [00:23:10] okay [00:23:18] I still have no idea how to reproduce that error though. [00:23:27] That was the closest I got. [00:23:48] Then I gave up, and searched for code I had already written. [00:24:28] legoktm: Also, gmail moved that to my spam. [00:24:38] What spamfilter do you have? :p [00:25:10] legoktm: On the plus side, I should now get daily spam when a bot of mine goes down. [00:25:19] :D [00:25:40] a930913: I whitelisted wmflabs.org / wikimedia.org and a few other addresses [00:25:55] a930913: also, have you updated your bots for the query-continue change? [00:26:12] LALALALALA [00:26:16] Can't hear you. [00:27:00] I will fix it when it breaks something. [00:27:03] lol [00:27:06] .......alright [00:27:20] I don't think that concerns me though. [00:27:35] what are your bots usernames? I can check [00:27:44] legoktm: Does pywikibot spam warns regardless? [00:27:49] I wasn't on the list. [00:27:55] CBNG was though :/ [00:28:13] a930913: list was only top 100. if you have a unique user agent you can use https://en.wikipedia.org/wiki/Special:ApiFeatureUsage [00:28:53] legoktm: Pywikibot uses the script name for the agent, right? [00:28:57] yes [00:29:15] if you're using a reasonably modern version [00:30:09] legoktm: Any particular format? [00:30:19] I think it's regex? let me check [00:30:56] 10Tool-Labs-tools-Other: Fix tool kmlexport - https://phabricator.wikimedia.org/T92963#1339884 (10Teslaton) @yuvipanda: who actually is a maintainer of the tool? I know of three people who may have contributed to it in the past: https://en.wikipedia.org/wiki/User:Para https://de.wikipedia.org/wiki/Benutzer:Kolo... [00:31:30] a930913: https://www.mediawiki.org/wiki/Extension:ApiFeatureUsage isn't clear. I guess it's substring matching? [00:31:46] a930913: I have access to the raw data if you want me to run a grep [00:32:43] legoktm: wikidiffnotedit.py is BracketBot. [00:34:02] legoktm: refbot.py is ReferenceBot. [00:34:54] a930913: ummmmmmm, are you using compat??? [00:35:13] 2015-06-04 11:52:00 mw1200 enwiki api-feature-usage INFO: "action=query&!rawcontinue&!continue" "BracketBot" "IP" "" "pywikipedia-wikidiffnotedit.py/r11590 Pywikipediabot/1.0" {"private":true} [00:35:17] a930913: I see a lot of ^ [00:35:27] legoktm: I am using whatever I used at the time. [00:35:32] Each bot is different. [00:36:17] 10Tool-Labs, 10Incident-20150602-gridengine-dns-failure, 3Labs-Sprint-100, 5Patch-For-Review: Tools: puppetize the alias_hosts workaround for mismatching DNS node names - https://phabricator.wikimedia.org/T101296#1339898 (10coren) 5Open>3Resolved [00:36:57] legoktm: Why can't I see that? https://en.wikipedia.org/wiki/Special:ApiFeatureUsage?wpagent=pywikipedia-wikidiffnotedit.py%2Fr11590+Pywikipediabot%2F1.0&wpdates=2015-05-05&wpdates-end=2015-06-05 [00:37:03] I don't know [00:37:08] legoktm@fluorine:/a/mw-log$ grep -c "BracketBot" api-feature-usage.log [00:37:09] 48 [00:37:37] that's less than 24h [00:38:23] 10Tool-Labs, 3Labs-Sprint-100: setup host-based auth for tools hosts properly - https://phabricator.wikimedia.org/T98714#1339902 (10coren) FWIW, it would have been okay to allow HBA to intrastructure hosts as well given their access config demands membership in the $projectname.admin group. [00:38:57] If I recall, unless wikipedia gets more than 500 edits in some period of time, it won't need to continue. [00:39:26] 10Tool-Labs, 10Incident-20150602-gridengine-dns-failure, 3Labs-Sprint-100: Tools: puppetize the alias_hosts workaround for mismatching DNS node names - https://phabricator.wikimedia.org/T101296#1339905 (10coren) [00:41:36] Sorry, it will never continue, just miss the edits :D [00:41:49] legoktm: Anyway, should I be worried? [00:41:56] I have no idea [00:42:20] a930913: I think you should be worried about CBNG [00:43:07] legoktm: More worrying as DamianZ isn't on. [00:43:20] When is doomsday? [00:43:27] I might make a countdown clock. [00:45:52] Oh, July 2. [00:46:13] legoktm: I thought it was June 2 and had already happened until you came along. [00:46:53] lol [01:36:20] 6Labs, 10Labs-Infrastructure, 6operations, 10wikitech.wikimedia.org, and 2 others: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1340069 (10Dzahn) a:3Dzahn [01:38:33] 6Labs, 10Labs-Infrastructure, 6operations, 10wikitech.wikimedia.org, and 2 others: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1340082 (10Dzahn) pre-up /sbin/ip token set ::208:80:154:136 dev eth0 + up ip addr add 2620:0:861:2:208:80:154:136/64 dev eth0 Notice: /... [01:49:53] 6Labs, 10Labs-Infrastructure, 6operations, 10wikitech.wikimedia.org, and 2 others: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1340118 (10Dzahn) silver.wikimedia.org has address 208.80.154.136 silver.wikimedia.org has IPv6 address 2620:0:861:2:208:80:154:136 --- ;; ANS... [01:52:51] 6Labs, 10Continuous-Integration-Infrastructure, 10Labs-Infrastructure, 6operations: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1340120 (10scfc) 5Resolved>3declined (AFAIUI, the underlying issue has not been researched or r... [02:26:49] 6Labs, 10Labs-Infrastructure, 6operations, 10wikitech.wikimedia.org, and 2 others: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1340174 (10Dzahn) silver has an IPv6 address on eth0 now and also the AAAA record in DNS. that would have resolved it if wikitech was a CNAME... [06:42:47] PROBLEM - Puppet failure on tools-trusty is CRITICAL 60.00% of data above the critical threshold [0.0] [07:07:44] RECOVERY - Puppet failure on tools-trusty is OK Less than 1.00% above the threshold [0.0] [08:13:31] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL 20.00% of data above the critical threshold [0.0] [08:14:31] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 40.00% of data above the critical threshold [0.0] [08:14:33] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL 30.00% of data above the critical threshold [0.0] [08:16:50] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 60.00% of data above the critical threshold [0.0] [08:41:49] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0] [08:43:31] RECOVERY - Puppet failure on tools-exec-1215 is OK Less than 1.00% above the threshold [0.0] [08:44:33] RECOVERY - Puppet failure on tools-exec-1410 is OK Less than 1.00% above the threshold [0.0] [08:44:35] RECOVERY - Puppet failure on tools-exec-1409 is OK Less than 1.00% above the threshold [0.0] [10:06:00] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1340642 (10Magnus) Still can't start service: magnus@wdq-mm-02:~/wikidataquery$ sudo service wdq-mm start start: Job failed to start Maybe something is not installed (mysql client lib, etc.)? [10:52:24] (03PS1) 10Sitic: Use custom i18n message interpolation [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/216063 (https://phabricator.wikimedia.org/T101438) [10:52:52] (03CR) 10Sitic: [C: 032 V: 032] Use custom i18n message interpolation [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/216063 (https://phabricator.wikimedia.org/T101438) (owner: 10Sitic) [12:29:28] 10Tool-Labs, 10Continuous-Integration-Config: Set up lint checks for labs/toollabs - https://phabricator.wikimedia.org/T65687#1340931 (10hashar) a:5scfc>3None [12:37:15] (03PS1) 10Sitic: Expire main task after 60 seconds [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/216072 [12:37:31] (03CR) 10Sitic: [C: 032 V: 032] Expire main task after 60 seconds [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/216072 (owner: 10Sitic) [13:17:10] Coren, YuviPanda: Was there some labs redis "rollback" this morning? My bot records last-run timestamps in redis for https://tools.wmflabs.org/anomiebot/, and this morning I'm seeing some tasks with old timestamps but the bot's logs indicate the tasks did get run on schedule. [13:17:40] anomie: we restarted redis to deal with a security patch, but nothing outside of that. [13:17:49] anomie: I can take a look and verify, moment [13:18:22] bah, it *has* been fucked up a bit [13:18:42] Going by my logged timestamps, it seems like changes between around 2015-06-04 15:11:52 and 2015-06-05 12:00:59 UTC might have gotten lost [13:19:51] YuviPanda: master/slave sync issue? Could the clients have flipped to the slave or is this not automated yet? [13:19:51] I don't know if they're lost, but redis is certainly in an inconsistent state now [13:20:15] Coren: nope, look at /etc/hosts. it's pointing to nothing, so people are hitting tools-redis.eqiad.wmflabs when they should be hitting tools-redis-01 [13:20:21] ipresolve failure, perhaps. looking [13:22:31] anomie: can you connect to tools-redis-01 and see what the state of data in that is like? [13:23:01] Sure, just a minute [13:24:22] YuviPanda: That seems even worse, last timestamp logged there is 2015-06-04 15:41:38 UTC [13:26:10] anomie: right, so I think we're kind of screwed and lost data there :( [13:26:31] anomie: tools-redis should've been readonly and rejecting any data since forever, but that doesn't seem to have happened, so it's super inconsistent now [13:27:03] anomie: puppet is rummaging through moving everything to tools-redis-01 now (again). 10.68.16.18 is the old tools-redis, and you can copy data over if you want. [13:27:06] sorry! [13:27:07] YuviPanda: I note that https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Redis still says tools-redis is the one to connect to, and if there was an announcement to change to something else I don't remember seeing it [13:27:21] anomie: there's change there [13:27:38] anomie: we just made the tools-redis.eqiad.wmflabs point to tools-redis-01 - can point to tools-redis-02 if -01 goes down [13:27:56] anomie: except the 'pointing' broke, and hence you got the old *instance* named tools-redis [13:28:11] anomie: I meant there's *NO* change there [13:28:13] not there's change there [13:28:16] you still connect to tools-redis [13:28:22] So leaving my config as "server=tools-redis:6379" is fine? Good. [13:28:25] yes [13:28:35] you need to 'copy things over' only if you want to do data recovery [13:29:08] My data in redis is all caches, persistent data is in labsdb. So I'm good there. [13:29:18] eah [13:29:19] yeah [13:29:21] ah that explains the strange redis issues I saw. http://graphite.wmflabs.org/render/?width=586&height=308&target=tools.tools-redis.redis.6379.memory.internal_view&from=-2days looks a bit strange too [13:29:23] and that's how it should be! [13:29:35] sitic: yeah. I'm going to kick all the instances now, expect another hiccup [13:30:44] anomie: can you file a bug? [13:31:25] YuviPanda: I can file bugs ;) But in this case I don't really know what the bug should say beyond "something went wrong with redis". [13:31:51] anomie: file it with the weird behaviour you found on your tool's redis data, and I'll expand and add details [13:33:26] should setup nutcracker properly than do this /etc/hosts hack [13:34:11] 10Tool-Labs: Something went wrong with redis 2015-06-04 to 2015-06-05 - https://phabricator.wikimedia.org/T101514#1341033 (10Anomie) 3NEW a:3yuvipanda [13:34:23] 10Tool-Labs: Something went wrong with redis 2015-06-04 to 2015-06-05 - https://phabricator.wikimedia.org/T101514#1341041 (10Anomie) IRC log: ``` Coren, YuviPanda: Was there some labs redis "rollback" this morning? My bot records last-run timestamps in redis for https://tools.wmflabs.org/anomiebot/, and... [13:34:34] YuviPanda: https://phabricator.wikimedia.org/T101514 [13:37:32] anomie: thanks! [13:57:54] 6Labs, 10Labs-Infrastructure, 6operations, 10wikitech.wikimedia.org, and 2 others: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1341313 (10wpmirrordev) Let us test this from an IPv6 only network: (shell) ping6 -c 1 wikitech.wikimedia.org PING wikitech.wikimedia.org(silv... [14:19:14] so, is tools-redis read-only atm? [14:19:45] gifti: no, shouldn't be. [14:19:50] it should just connect to tools-redis-01 now? [14:20:13] i have an array that should be reduced in size over time, but it doesn't, at the same time the entries seem to be processed [14:20:38] gifti: I see your exec node is having trouble connecting to it. [14:20:40] gifti: am looking at it now [14:20:55] hm [14:20:58] thx [14:21:40] uhm.... webservice start fails [14:21:52] https://www.irccloud.com/pastebin/nSVHYIYj/webservice%20start [14:22:07] Revi: try on tools-login.wmflabs.org? [14:22:13] wait a sec.... [14:23:58] ok, tools.trusty specific error [14:26:13] Revi: yeah, tools-trusty should be deprecated. I'll announce on the mailing list. Just use tools-login.wmflabs.org - it is trusty too now [14:26:23] kk then [14:26:41] Revi: I also left you a talk page note about moving crontabs for your bot [14:29:51] Saw it [14:30:32] When I set it up, I couldn't understand jsub things so I just didn't use it [14:32:05] and yeah, it works fine as it was [14:32:38] Revi: cool, thanks [14:32:50] gifti: jobs starting now on your exec nodes should use the right redis instance [14:32:58] the old one has been shut down [14:33:03] no problem :D thanks for converting it to jsub [14:33:06] ah [14:34:06] Revi: :) yw! [14:45:52] my cron mail is cut off because so many ge errors :\ [14:54:25] "cut off"? [14:57:10] the rest is missing [15:27:08] andrewbogott: in openstack terminology, does a 'server' map to what we call 'instance'? [15:27:22] (am reading http://docs.openstack.org/developer/python-novaclient/ref/v2/servers.html) [15:45:12] YuviPanda: For the most part, and in our case yes. [15:45:55] YuviPanda: Strictly, an "instance" in openstack is a "server" that is a VM. Openstack also supports bare metal servers that aren't instances but we don't use that feature. [15:46:55] Coren: cool. [15:46:58] The primary difference being, of course, that you can't just create servers that aren't instances out of thin air (otherwise Dell would despair). :-) [15:47:02] I'm exploring the nova REST API [15:47:09] instead of parsing the commandline [15:48:04] Yeah, for API purposes, all instances are servers and - in our current setup - all servers are instances. [15:59:33] YuviPanda: I think that must be what they mean [16:04:29] andrewbogott: cool. I looked at novastats and it feels a lot slower than using the API directly [16:04:53] It is way slower [16:05:05] If you want to rewrite it to query REST rather than use the commandline it would be better [16:05:15] but generally when I use it I’m not in a hurry [16:05:17] andrewbogott: I'm messing around with it, yeah. [16:05:29] andrewbogott: OS-EXT-SRV-ATTR:host is what refers to the current host, right? [16:05:38] yeah [16:05:52] andrewbogott: cool! [16:34:29] Coren: so I ran a script to detect if there are any redundant parts hosted in the same host. tools-shadow and -master are both on 1004 :( [16:34:45] * Coren headdesks. [16:34:55] At least, moving -shadow is teh trivialz [16:43:24] Coren: yup! [16:43:29] Coren: let me file a task for it. [16:43:34] we should also move them to trusty [16:49:01] Well, it's probably worthwhile to switch -shadow alone first, test it out for a while. [17:00:09] !log cvn Unscheduled reboot of cvn-app5. Bots are having trouble writing to local disk. [17:00:12] Logged the message, Master [17:23:47] Coren: andrewbogott I'm thinking I'll cold migrated tools-shadow to another instance (based off of https://wikitech.wikimedia.org/wiki/OpenStack#cold-migrate) [17:23:51] any objections? [17:24:07] YuviPanda: nope. Just remember that after a cold-migrate you have to clean up the image files on the old host. [17:25:06] YuviPanda: Yeah, I'd rather not play with the switch to trusty just before the weekend. [17:30:22] Coren: yeah, I agree. [17:30:42] andrewbogott: hmm, I don't see instructions for cleaning up image files on https://wikitech.wikimedia.org/wiki/OpenStack#cold-migrate [17:33:24] 'Note that disk image on the old host is /not/ deleted, so you may want to clean it up by hand after migration.' [17:33:29] I guess that isn’t instructions. [17:33:41] Oh, the next sentence is, though! “ Instance files live in /var/lib/nova/instances. “ [17:37:16] andrewbogott: fair enough :) let me do it [17:41:16] andrewbogott: how do I setup an agent that it *can* use? I can't source novaenv without becoming root and root can't use my ssh-agent thingy [17:42:01] Um… I do it by forwarding a root key, which is a Bad Thing [17:42:12] oh [17:42:13] hmm [17:42:24] So I guess… I encourage you to set up a better way? You can certainly just make a local copy of novaenv if that helps. [17:42:45] at some point in the future, I guess [17:42:50] But, I think that there may be a better way than using cold-migrate… [17:43:04] Try live-migrate, suspend, shrink [17:43:16] nah, it doesn't matter if that host reboots so 'tis ok [17:43:19] I think you can do all that without any key-forwardning, and it should be slightly nicer anyway (cleans up, doesn’t reboot instance) [17:43:27] right [17:43:38] but friday evening 6PM, I'm just going to cold migrate :D [17:43:44] ok [17:43:48] to labvirt1002! [17:43:50] (picked randomly [17:44:00] !log tools migrate tools-shadow to labvirt1002 [17:44:04] Logged the message, Master [17:44:10] andrewbogott: Also "sometimes failes randomly in odd ways that seems to have sometimes destroyed the host" [17:44:29] Did I say that about live migration? [17:45:16] No, those weren't quote-quotes, they were scare quotes. :-) [17:47:04] andrewbogott: hmm, I think it just failed with > Backing file not found. [17:47:27] this is also a very, very old instance [17:49:02] YuviPanda: ok, you’ll have to fish around for the backing file, possibly on the ciscos :( [17:49:12] I couldn’t tell you why the instance is working now [17:49:25] ouch [17:50:33] andrewbogott: I'm just going to file a bug and let it be then [17:50:46] ok. You could also rebuild it [17:51:04] If we're to rebuild it, we'll want to trusty it up [17:52:08] I foresee a week of pain as all the unpuppetized things bite everyone in all the places [17:52:25] There is nothing unpuppetized on the masters. [17:53:46] Indeed, contrary to what might be expected, the ge masters are the simplest of all the instances. [17:54:16] famous last words :) [17:59:14] * Coren goes to fetch food before he starves. [19:03:50] Coren / andrewbogott / YuviPanda : Did you just do some sort of database configuration change? A query that worked this afternoon now dies after exactly 5 minutes with: [19:03:51] ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 104 [19:04:27] multichill: Nothing I did, certainly, and I am aware of no ongoing work to the replicas. [19:04:53] !log tools.lolrrit-wm ori restarted the bot after it stopped working due to the gerrit restart [19:04:56] Logged the message, Master [19:05:04] Weird, kill times: real 5m0.872s, real 5m1.061s, real 5m0.886s [19:05:21] Hm, repeatable? [19:05:51] mysql --defaults-file=~/replica.my.cnf -h dewiki.labsdb [19:06:04] blocks without connecting [19:06:23] O crap [19:06:27] Mysql is just broken [19:07:02] Coren: Erhm, so looks like we're not able to connect to mysql at the moment.... [19:07:14] Looks like one of the servers is ill. [19:07:23] And after 5 minutes the client probably gives up [19:11:16] Yeah, one of the three replicas is ill. Looking into it now. [19:17:44] db server restarted; it looks healthy now. [19:23:08] Coren or andrewbogott: can I get the security groups quota raised on the deployment-prep project? we are at 10/10 right now and I need to make a new group for logstash hosts [19:23:18] bd808: yep, one minute... [19:23:25] Ninja'd [19:24:28] bd808: done [19:24:39] thanks folks [19:25:12] Thanks Coren, running queries now, let's see if it works :-) [19:37:57] Coren: https://gerrit.wikimedia.org/r/#/c/216154/ seems to work if I give ‘glance’ a shell in /etc/passwd on both hosts. But that seems risky… [19:38:12] Should I just create a new ‘glance-sync’ user for that one cron? [19:42:23] * Coren ponders [19:42:55] Of course the new user would have to beable to chown glance things [19:43:39] Might not be a major issue if the authorized_keys is suitably restricted. [19:44:23] Not like we have password auth - the only possible way to start a session with glance would be via ssh and if the command that can be run is limited, then it's not a huge issue imo. [19:44:23] which? [19:44:35] That said, I'd consult moritz [19:45:07] ok. If I were to puppetize /bin/bash for that user… how would I do it? [19:45:37] Oh, wait, that users comes from a dep package, right? [19:46:20] right [19:46:22] Gah. Those are a pain to mess with puppet, at best. A different user might be required if only because of that. [19:46:30] yeah, ok. [19:47:01] Can you suggest a clever solution to handle ownership? Otherwise I’ll just have a root cron run and chown everything. [19:48:21] Hm. That depends on whether ownership is, in fact, required. If group membership suffices with suitable permissions, then sgid directories would handle it. [20:02:02] Coren: https://gerrit.wikimedia.org/r/#/c/216296/ [20:02:08] Also, can you tell me how to restrict the use of that key? [20:02:45] andrewbogott: Hm, hang on, I don't remember the puppet stanza for it. [20:05:20] andrewbogott: You need to put a command="the_command" in the key options; the ssh_authorized_key puppet resource supports an options => stanza, but I don't know if our ssh::userkey does. [20:05:45] * Coren checks [20:06:04] looks like not [20:06:29] Ah, it uses a source => for the full file, you can just put it there instead. [20:06:32] (in the file) [20:07:33] You put options in front of the key type, like so: [20:07:55] command="whatever" ssh-dss AAAA....== name [20:08:23] so can it just be command=“chown” ? [20:08:30] I guess that’s a big hole still [20:08:33] I’ll but the full command in [20:09:09] andrewbogott: You need to put the full command; the supplied command will be outright ignored. [20:09:21] ah, ok. [20:09:26] um... [20:09:33] except the full command contains variable resolution [20:09:56] and it’s rsync, so I don’t even know what the ‘command’ really is… [20:10:25] Ah. It's a bit more complex then; the command you execute /does/ have access to the original command in an environment variable, but now you'd have to write a bit of code to deal with it. [20:11:44] Still doable, but now we're getting in "possibly more complex that justified for this use case" [20:12:25] yep, I think I will do nothing for now. [20:12:41] Would still appreciate a glance at https://gerrit.wikimedia.org/r/#/c/216296/ [20:25:42] hello [20:25:50] does anybody have an e-mail to ZhouZ? [20:26:19] is this person on phab? [20:26:24] yup [20:26:31] https://phabricator.wikimedia.org/p/ZhouZ/ [20:26:54] https://wikimediafoundation.org/wiki/User:ZZhou_(WMF) [20:26:57] i basically want to draw his attention to my reply there: https://phabricator.wikimedia.org/T97844 [20:26:59] has an email [20:27:13] awesome [20:27:15] thanks [20:27:16] there you go then, or I was going to say you could use conpherence and they will probably get it as an email [20:27:19] d33tah: Normally, just pinging on the ticket also works. [20:27:33] ^ that too [20:28:14] Coren: out of many times i tried that, it only worked once [20:35:04] wikibugs is missing [20:43:12] twentyafterfour, legoktm, YuviPanda: ^ [20:43:33] hmm [20:53:12] Coren: Is the replication still working for the database server you gave a nudge? [20:59:13] !log tools.wikibugs legoktm: Deployed 82b0b9f487ece85a40595b80f3f690554743e472 Ignore Forrestbot wb2-phab, wb2-irc [20:59:15] Logged the message, Master [20:59:59] multichill: It should; a simple restart shouldn't have affected that - but since it was ill that might have broken replication before the restart. Open a ticket so one of our DBAs can look at it? [21:00:54] Krenair, sitic: ^ restarted [21:01:03] :-) [21:01:08] 10Wikibugs: wikibugs test bug - https://phabricator.wikimedia.org/T1152#1342411 (10Legoktm) ! [21:05:23] Coren, andrewbogott: if I asked for more ram quota on the deployment-prep project what you you say? I want to build a xlarge instance and there in only ram left for a large [21:05:51] bd808: I'd cry a little inside. :-) [21:05:54] the node will be a new logstash server and once it is stable I will kill the old xlarge that we have now [21:06:16] How much do you need, another 8G then? [21:06:56] it looks like I need another 6144 [21:08:07] U can haz ramz. [21:08:36] {{done}} [21:09:03] yay, u haz a ram [21:09:18] thx Coren [21:10:14] 10Tool-Labs: Labs database server with ipaddress 10.64.37.4 seems to have stopped replicating - https://phabricator.wikimedia.org/T101561#1342437 (10Multichill) 3NEW [21:11:19] What is the monitoring these days YuviPanda? [21:11:30] Docs link to http://ganglia.wmflabs.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=tools&h=tools-master&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4#mg_SGE_div and http://icinga.wmflabs.org/cgi-bin/icinga/status.cgi?hostgroup=tools&style=detail [21:11:39] Both seem to be incorrect [21:12:26] multichill: shinken.wmflabs.org maybe? but shinken I think [21:13:06] chasemp: http://shinken.wmflabs.org/user/login?error=Invalid%20user%20or%20Password asks me for a username and password over http [21:13:19] labs creds? [21:13:22] honestly not sure [21:13:28] I'm not using http [21:13:29] but yeah [21:13:38] right [21:15:34] I don't see anything via wikitech search so seems like a poke YuviPanda situation [21:23:09] chasemp: I think it's guest / guest and user accounts have to be created in Shinken itself iirc [21:31:05] Coren, andrewbogott: another whine from me :( My new instance deployment-logstash2.eqiad.wmflabs (i-00000ca9.eqiad.wmflabs) is a jessie host in the deployment-prep project. It is spewing out "puppet-agent[1230]: Could not request certificate: Error 500 on SERVER:" on the web console and hasn't gotten far enough to allow me to ssh in yet. [21:31:20] I restarted the deploymnet-prep puppetmaster but that didn't help [21:31:32] and there is no pending cert to be signed there [21:41:54] Just was going to ask - if I run puppet agent --test on labs and get 500 from abs-puppetmaster-eqiad.wikimedia.org:8140 - is it normal? [21:41:59] or something is broken? [21:42:14] https://labs-puppetmaster-eqiad.wikimedia.org:8140 [21:44:40] bd808: looking. [21:44:54] thx [21:48:04] um... [21:48:11] SMalyshev: what instance are you seeing that on? [21:48:27] andrewbogott: db01 on labs [21:48:41] what project? [21:49:05] andrewbogott: wikidata-query [21:58:24] SMalyshev: looks to me like that instance is oom or crashing or something [21:59:04] andrewbogott: which one? you mean puppetmaster or client? [21:59:11] db01 [21:59:28] andrewbogott: don't see anything wrong there. Why you think so? [22:00:02] bd808: I can’t create a new jessie instance either. I don’t know what changed. [22:01:34] andrewbogott: ok. Is there anything I can do as a mere mortal to help fix that? [22:01:41] not yet, at least [22:04:13] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:04:17] PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:04:26] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:04:32] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL 100.00% of data above the critical threshold [0.0] [22:04:36] PROBLEM - Puppet failure on tools-precise-dev is CRITICAL 100.00% of data above the critical threshold [0.0] [22:04:36] PROBLEM - Puppet failure on tools-exec-1404 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:04:40] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1405 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:04:44] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:04:52] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:04:57] SMalyshev: try now? [22:05:01] PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:07] PROBLEM - Puppet failure on tools-exec-1403 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:07] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:09] PROBLEM - Puppet failure on tools-submit is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:15] andrewbogott: now works fine, thanks [22:05:21] PROBLEM - Puppet failure on tools-shadow is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:21] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1401 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:25] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:27] bd808: yours may come around in a moment too [22:05:33] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:33] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:37] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:37] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:41] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:47] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:53] PROBLEM - Puppet failure on tools-exec-1406 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:05:59] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:06:01] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:06:07] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:06:17] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:06:21] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:06:29] ohai shinken-wm. did you just see the puppetmaster restart? [22:06:33] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:06:34] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:06:34] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:06:42] PROBLEM - Puppet failure on tools-services-02 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:06:50] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:06:53] Yeah, the puppetmaster on virt1000 lost its marbles. Seems to be behaving now. [22:06:58] PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:00] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:05] Not that you would notice from watching shinken [22:07:06] andrewbogott: cool, thanks [22:07:10] PROBLEM - Puppet failure on tools-redis-02 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:14] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:25] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:25] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL 88.89% of data above the critical threshold [0.0] [22:07:25] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:37] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 70.00% of data above the critical threshold [0.0] [22:07:37] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:39] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:43] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL 90.00% of data above the critical threshold [0.0] [22:07:45] PROBLEM - Puppet failure on tools-webgrid-generic-1402 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:49] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:51] PROBLEM - Puppet failure on tools-mail is CRITICAL 100.00% of data above the critical threshold [0.0] [22:07:56] andrewbogott: I'm all better. thanks! [22:08:03] Man, were /all/ these instances hitting the puppetmaster at once? No wonder it collapsed. [22:08:03] PROBLEM - Puppet failure on tools-services-01 is CRITICAL 80.00% of data above the critical threshold [0.0] [22:08:04]