[00:42:42] 10Quarry: Time limit on quarry queries - https://phabricator.wikimedia.org/T111779#1794923 (10yuvipanda) So I can confirm that it does get killed in 30min but just doesn't get reflected in the status. I'm investigating why [00:44:36] 10Quarry: Time limit on quarry queries - https://phabricator.wikimedia.org/T111779#1794925 (10yuvipanda) ``` Nov 10 00:21:23 quarry-runner-01 celery[29630]: Traceback (most recent call last): Nov 10 00:21:23 quarry-runner-01 celery[29630]: File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in... [02:18:49] YuviPanda, around? [03:26:45] 6Labs, 10Labs-Infrastructure, 10MediaWiki-Vagrant: InstantCommons stopped working on Labs-Vagrant, now lots of missing images. - https://phabricator.wikimedia.org/T118226#1795065 (10Spage) 3NEW [03:57:45] Krenair: kindof [10:14:32] YuviPanda: you know pywikibot right? [10:14:46] or at least, more than me? ;) [10:14:55] oh wait, your in a silly country and timezone now, BAH! [10:34:54] lol what happened? [11:09:17] YuviPanda, is there a page documenting the special labs views? [12:26:38] 6Labs, 10Labs-Infrastructure, 10netops, 6operations, and 3 others: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1795538 (10faidon) @chasemp, is this done? [12:30:38] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 5 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1795543 (10faidon) [12:30:41] 6Labs, 6operations, 5Patch-For-Review: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1795540 (10faidon) 5Open>3Resolved a:3faidon The patch above got three +1s and was therefore merged. [13:03:19] 6Labs, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata: [Bug] Wikidata JSON dumps gets deleted after every new Wikidata dump - https://phabricator.wikimedia.org/T107226#1795574 (10Hydriz) 5Open>3Invalid a:3Hydriz The most recent Wikidata dump (October 2, 2015)... [14:23:28] YuviPanda: It looks like puppet failed everywhere, around midnight last night. Do you know what that’s about? It’s the second time this week it’s happened. (everything recovered a while later.) [14:35:56] andrewbogott: apt update failing, I think [14:36:10] there's no puppet failure, it's just staleness, which suggests puppet didn't run at all [14:36:26] and I have alerts from toolsbeta complaining about /etc/cron.daily/apt: An error occurred: '504 Gateway Time-out [IP: 208.80.154.10 8080]' The URI 'http://security.ubuntu.com/ubuntu/pool/main/l/linux/linux-image-3.13.0-68-generic_3.13.0-68.111_amd64.deb' failed to download, aborting [14:36:41] valhallasw`cloud: ah, ok. Not much we can do about that. [14:36:44] thanks! [14:37:32] (security updates over http... I suppose the packages are gpg signed, but still...) [15:16:35] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:16:57] PROBLEM - Puppet failure on tools-exec-1217 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:17:19] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [15:17:40] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:18:14] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:19:52] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:20:04] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:20:06] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [15:22:04] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:22:05] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:22:16] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [15:22:38] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:22:53] is this just puppet in tools project freaking out? [15:24:51] side-effeect of 16:13 PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%, I assume [15:25:18] Error: /File[/var/lib/puppet/lib]: Could not evaluate: Connection refused - connect(2) Could not retrieve file metadata for puppet://labs-puppetmaster-eqiad.wikimedia.org/plugins: Connection refused - connect(2) [15:26:02] which is labcontrol1001.wikimedia.org [15:26:15] ok thanks valhallasw`cloud -- fwiw we think that was an onsite issue that was in theory accidental and momentary [15:26:54] ok. The fact we only got 10 or so alerts suggests it was indeed transient [15:52:09] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1796007 (10chasemp) [15:52:24] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [15:52:44] RECOVERY - Puppet failure on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [15:53:06] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [15:54:44] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:00] RECOVERY - Puppet failure on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [15:55:15] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [15:56:41] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:03] RECOVERY - Puppet failure on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:04] RECOVERY - Puppet failure on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:08] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:16] RECOVERY - Puppet failure on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:40] RECOVERY - Puppet failure on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [16:02:59] Does Content Translation Tool overwrite existing articles? [16:03:43] le 10/11/2015. ma , sens touts que nous sommes bravos pour moi...... [16:06:41] GRAND - ILES REUNIONS. HAPPIA...HIPPI.COM [16:09:37] pour faire dessination que peintre s' couleure . deviantart muro d'images payp [16:10:03] doctaxon: #mediawiki-i18n might be a better place to ask [16:10:41] !ops [16:10:57] hmm, I gotta run to a conference and don't have op anyway [16:10:59] good luck :) [16:11:19] valhallasw`cloud: am at kubecon.io today and tomorrow (and yesterday) so no fastapt until thursday :( sorry [16:11:25] ok! [16:11:27] have fun! [16:12:13] this won't have been working for a while: ChanServ- 9 *!~gerrit-wm@*.wikimedia.org +V [modified 3y 41w 4d ago] [16:43:10] 6Labs, 6operations: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1796190 (10fgiunchedi) @andrew thoughts on what component/issue it might be causing this? [17:40:01] hello, trying to figure out some performance issues with my tools. I notice the queries run about twice as fast on my local than on labs. My local is just uses an SSH tunnel to connect to the same db [17:40:28] valhallasw`cloud: ^ can you take a look? I feel like maybe we're running out of nodes [17:40:29] or perhaps the Ruby code itself is running significantly slower on labs? [17:40:33] either are possible [17:40:38] depends on ruby version [17:40:46] Ruby 2.1 [17:40:49] same on my local [17:41:04] MusikAnimal: how are you measuring this? [17:41:32] my tools use a before/after time comparison where it makes the query [17:41:58] the frontend of my tools hit the API that I built, and the response contains that elapsed time [17:44:37] eh, and "twice as fast" might be inaccurate, maybe runs ~30% slower on labs [17:44:40] but this is consistent [17:44:59] are you connecting to the same database server? [17:45:31] yes I believe so [17:45:34] the replication db [17:45:42] my local just goes through an SSH tunnel [17:46:07] there's more than one server [17:46:25] not sure how to check [17:46:54] how I'm connecting to the repl on my local/on labs is the same way they describe at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database [17:47:46] they are different credentials, once uses my tools-login, the other my actual tool [17:48:08] the credentials shouldn't matter [17:48:25] but again, you connect to some server. Which one? You have to define it somewhere in your code [17:48:40] in the ssh tunnel case it's the server in the -L parameter of the ssh command [17:49:13] enwiki.labsdb [17:55:13] if it helps, my tools run on trusty, and it has to reload all the gems and what not. The gem versions are the same as my local, but maybe (wild guess) there's some read/execute lag on the filesystem [17:55:40] as it has to go through a bunch of gems before actually hitting the replication db [17:56:16] unless ruby is unloading gems, that should only matter on the first request [17:58:15] yeah when I first boot up the tools it takes quite a while, which I assume is all the gem stuff, or something... but after that pages load quite fast. It's just the querying itself that I am noticing is slower [17:58:40] how does it compare if you just run the query in the mysql client? [17:58:44] also, which tool [17:58:56] https://tools.wmflabs.org/musikanimal [17:59:05] it will output the elapsed time [17:59:19] let me try doing the queries via the mysql gem in pry console on labs/local [18:06:19] hm. yeah, that seems to connect to the same database server [18:07:13] the only other obvious option is caching, but I assume you tested both tools-first-then-local and the other way around [18:07:37] it uses the redis for caching, which is disable on my local [18:08:05] and I tested on my local first, then labs [18:08:16] when measuring the elapsed time [18:08:23] ...so what you're saying is that you might actually be measuring the redis response time? [18:08:44] well, could be, but the users I'm querying I don't think were in redis yet [18:09:04] also I did try enabling redis on my local, pretty sure [18:09:14] same redis server, via SSH tunnel [18:09:36] of course I am forgetting the other obvious option [18:09:37] server load [18:09:45] top - 18:09:42 up 31 days, 22:56, 1 user, load average: 1.50, 1.26, 1.35 [18:10:02] but I don't get why [18:10:28] yeah I'm getting similar times when querying via mysql gem in pry; but I also wonder if that's query caching [18:10:49] and if you use the mysql command line client? [18:11:50] will try that next; do you know how long the query cache lasts? [18:12:49] no, that depends on how busy the server is [18:13:50] cause I think that's what I'm seeing now [18:14:12] if I run on local then labs, labs is faster, and vice versa [18:14:50] this is in the console; the webservice seems to always be a bit slower on labs, not sure why that would be [18:15:34] that could be load related [18:15:44] http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1447179226.819&target=tools.tools-webgrid-generic-1401.loadavg.15&target=tools.tools-webgrid-generic-1405.loadavg.15&target=tools.tools-webgrid-generic-1404.loadavg.15&target=tools.tools-webgrid-generic-1403.loadavg.15&target=tools.tools-webgrid-generic-1402.loadavg.15 < this doesn't look very [18:15:44] nice [18:16:09] although -1405 where your webservice is running looks ok [18:16:46] still a load average of 2 [18:17:01] but that should be OK with 4 cpus [18:17:15] I turned on redis on my local, ran a query on a user I hadn't ran before, then ran it on labs, and still slower [18:17:16] oh [18:17:42] I guess the other thing is on labs my tools use a unicorn server, on my local it uses rackup, but unicorn should be way faster [18:17:43] hm [18:17:52] but it's possible it is messing up somehow [18:18:12] are you writing anything to disk? [18:18:20] nope [18:18:30] well [18:18:32] the logs [18:19:01] unicorn actually has two, the .log and .out, rackup just spits it into stdout [18:19:26] and it logs every query... so maybe that's related [18:19:46] do you flush() the file descriptor? [18:20:06] hmm not familiar with that [18:21:11] this is my unicorn config: http://pastebin.com/N7eWKP1v [18:21:29] however it writes to those files is internal, I don't tell it anything other than where to write it [18:22:16] MusikAnimal: simple test would be to change the path to something in /tmp instead [18:23:13] so tell it to write to /tmp and not ~/mytool/log/unicorn.log ? [18:23:41] yeah, so /tmp/musikunicorn.log or something like that [18:23:58] I didn't even know we were allowed to write there [18:24:54] I've no problem with that. Will take a while to find out, restarting the server takes 5-10 minutes, which is another thing I wanted to ask. It seems to take longer and longer each time (could just be me though) [18:25:05] I wonder if `qdel` is killing the unicorn servers [18:25:31] it should [18:25:35] is there a way for me to get into trusty and run `ps -ef | grep unicorn` ? [18:25:40] there should be two processes [18:25:48] or 3 I guess, one is the master [18:26:24] you should be allowed to ssh in [18:26:30] there's 3 processes on -1405 [18:28:08] ok so that's right then [18:29:38] so you think writing to /tmp will inevitably be faster? or is this purely to test if there's a difference [18:45:57] valhallasw`cloud: I restarted the tools and told unicorn to write /tmp/musikunicorn.log but it is not logging [18:46:21] however I will say the tools booted up much quicker this time than have recently! [18:53:30] actually looks like it's running OK now, not much slower if it at all than my local [18:53:33] did you guys do something? [18:53:57] my only issue now is that I have no logging :( [18:57:17] MusikAnimal: it's logging to /tmp on the exec host [18:57:20] not on tools-login [18:57:40] run: stat -xml [18:57:45] eh, qstat -xml [18:58:05] webgrid-generic@tools-webgrid-generic-1402.eqiad.wmflabs --> it's in /tmp on tools-webgrid-generic-1402 [18:58:37] cool, how can I get in there? [18:59:49] also, does stuff in /tmp get truncated at some point? the unicorn logs will grow quickly [19:00:10] uuh, you should be able to ssh in, but it doesn't seem to work [19:00:32] and no, I don't think we prune /tmp regularly [19:02:21] Nov 10 19:01:16 tools-webgrid-generic-1402 sshd[23960]: Failed hostbased for valhallasw from 10.68.17.228 port 59085 ssh2: RSA SHA256:..., client user "valhallasw", client host "tools-bastion-01.tools.eqiad.wmflabs" [19:02:21] Nov 10 19:01:16 tools-webgrid-generic-1402 sshd[23960]: userauth_hostbased mismatch: client sends tools-bastion-01.tools.eqiad.wmflabs, but we resolve 10.68.17.228 to 10.68.17.228 [19:02:23] huuuh. [19:07:29] hmm I dunno, I can live without logging =P [19:07:49] but just know that file will get big, really big [19:08:01] thank you for the help! [19:22:12] 6Labs: HBA failing to trusty instances on tool labs - https://phabricator.wikimedia.org/T116687#1796957 (10valhallasw) The full login log looks like this: ``` debug1: userauth-request for user valhallasw service ssh-connection method hostbased [preauth] debug1: attempt 1 failures 0 [preauth] debug2: input_userau... [19:33:58] 6Labs: HBA failing to trusty instances on tool labs - https://phabricator.wikimedia.org/T116687#1796999 (10valhallasw) ...and then I remembered something: T109989 /that/ was the red herring. And the default value of UseDNS was changed in ubuntu: https://bugs.launchpad.net/ubuntu/+source/openssh/+bug/424371 (or,... [19:46:39] PROBLEM - Host tools-andrew-puppettest is DOWN: CRITICAL - Host Unreachable (10.68.21.109) [20:01:52] shinken-wm, that’s because I deleted the instance. Time to move on. [20:02:08] YuviPanda: how do I make ^^ stop? [20:34:22] 6Labs, 6operations: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1797373 (10Andrew) It is probably the result of a leaked host record and/or a leaked dns entry. And, indeed, I was mucking around with the dns backend a couple of days ago so you might have... [20:39:02] I'm not able to login to tool labs anymore when using putty. WinSCP works fine. Was something changed? [20:50:42] apper: Works for my tools [20:53:22] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1797412 (10Andrew) [21:08:27] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1797463 (10Andrew) [21:10:28] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766532 (10Andrew) [21:19:49] andrewbogott: still at kubecon, won't be able to check :( [21:20:16] YuviPanda: is it normal for shinken to complain about deleted hosts forever? [21:20:28] nope [21:20:30] definitely not [21:20:32] so idk what's going on [21:21:09] andrewbogott: you can probably find it in /etc/shinken in shinken-01 if you wanna take a look [21:21:53] ok [21:37:20] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1797562 (10Andrew) [22:04:40] Hi everybody! [22:05:16] I have a webservice on tool labs that has ceased to work. It uses a custom lighttp.cpnf with a FCGI process [22:06:01] unfortunately it gives me The URI you have requested, /zoomviewer/iipsrv.fcgi, is not currently serviced. [22:06:11] which is weird [22:06:27] did I miss a memo on a tool labs webserver change? [22:07:13] Coren: ^ can you look at this? [22:07:16] * YuviPanda is still at kubecon [22:08:53] schwd: there wasn't really a change and this should be working [22:09:56] schwd: what tool is this? [22:11:05] ah zoomviewer [22:12:26] schwd: I see http://tools.wmflabs.org/zoomviewer/ [22:12:29] it works? [22:29:18] PROBLEM - Puppet failure on tools-redis-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:44:53] PROBLEM - Puppet failure on tools-redis-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:42:56] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1798052 (10Andrew) [23:43:21] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1766532 (10Andrew) All the boxes now have an OS installed and puppet and salt signed and running.