[00:47:29] any project working on analyzing wikipedia user behaviors? [06:37:39] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [07:07:42] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [09:29:41] (03PS1) 10Ricordisamoa: add .gitreview [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/180741 [09:41:34] 3Wikimedia-Labs-wikitech-interface: puppetize wikitech wiki configs - https://phabricator.wikimedia.org/T70535#932073 (10Andrew) 5Open>3Resolved Yep, this is done -- everything is in puppet and/or the wmf mediawiki config. [10:13:08] andrewbogott: but I guess we’ll still have it done by next week :) [10:13:23] That's what I thought last week :( [10:14:10] andrewbogott: heh :) [11:06:43] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<11.11%) [11:21:44] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [11:31:27] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:32:07] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:32:20] ok, who broke deployment-prep parsoid? [11:32:59] PROBLEM - Puppet failure on tools-exec-13 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:33:29] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:34:39] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:34:39] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [11:35:01] PROBLEM - Puppet failure on tools-exec-08 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [11:35:01] PROBLEM - Puppet failure on tools-exec-15 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [11:35:09] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:35:19] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:35:49] better question: who broke everything? :) [11:36:02] PROBLEM - Puppet failure on tools-webgrid-03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:36:34] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:36:52] PROBLEM - Puppet failure on tools-exec-12 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:37:18] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:37:46] PROBLEM - Puppet failure on tools-exec-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:37:50] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:37:54] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [11:38:37] PROBLEM - Puppet failure on tools-webgrid-04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:38:41] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:40:05] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:41:06] PROBLEM - Puppet failure on tools-exec-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:41:30] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:42:18] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:42:26] PROBLEM - Puppet failure on tools-webproxy is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:42:29] Krenair: looks like paravoid again :) [11:42:50] PROBLEM - Puppet failure on tools-exec-05 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:44:40] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:44:54] PROBLEM - Puppet failure on tools-exec-09 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:45:18] (03PS1) 10Faidon Liambotis: Stop installing Package['sudo-ldap'] [labs/private] - 10https://gerrit.wikimedia.org/r/180767 [11:45:28] PROBLEM - Puppet failure on tools-exec-10 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:45:42] (03CR) 10Faidon Liambotis: [C: 032] Stop installing Package['sudo-ldap'] [labs/private] - 10https://gerrit.wikimedia.org/r/180767 (owner: 10Faidon Liambotis) [11:46:10] (03CR) 10Faidon Liambotis: [V: 032] Stop installing Package['sudo-ldap'] [labs/private] - 10https://gerrit.wikimedia.org/r/180767 (owner: 10Faidon Liambotis) [11:46:10] PROBLEM - Puppet failure on tools-exec-06 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:46:12] PROBLEM - Puppet failure on tools-redis is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [11:46:16] YuviPanda: better? [11:46:25] paravoid: looking [11:46:32] well [11:46:33] trying [11:46:41] I think the repository is updating every minute [11:46:45] so it might take 0-60s :) [11:49:04] paravoid: heh, (File[/etc/sudo-ldap.conf] => Class[Ldap::Client::Sudo] => Class[Ldap::Client::Sudo] => File[/etc/sudo-ldap.conf]) [11:49:09] Error: Could not apply complete catalog: Found 1 dependency cycle: [11:49:16] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:49:34] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:49:56] PROBLEM - Puppet failure on tools-exec-11 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:50:16] PROBLEM - Puppet failure on tools-exec-03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [11:52:40] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<11.11%) [11:53:03] shut up shinken-wm [11:55:24] (03PS1) 10Faidon Liambotis: Remove Class['sudo'] -> User['root'] [labs/private] - 10https://gerrit.wikimedia.org/r/180768 [11:55:45] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Remove Class['sudo'] -> User['root'] [labs/private] - 10https://gerrit.wikimedia.org/r/180768 (owner: 10Faidon Liambotis) [11:56:03] this must be it, although not 100% sure [11:56:07] YuviPanda: ^ [11:56:24] the error message wasn't great, it didn't have base:: or passwords::root in it [11:56:27] so I'm not sure [11:58:24] paravoid: I’ll try in a minute... [11:58:29] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [11:58:40] I guess it's fixed [11:58:51] sorry about that :) [11:59:41] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [12:00:05] RECOVERY - Puppet failure on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [0.0] [12:00:21] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [12:01:27] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:01:35] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [12:01:51] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [12:02:07] RECOVERY - Puppet failure on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [12:02:21] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [12:02:55] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [12:02:57] RECOVERY - Puppet failure on tools-exec-13 is OK: OK: Less than 1.00% above the threshold [0.0] [12:03:41] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [12:04:39] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [12:04:44] paravoid: :) [12:04:57] RECOVERY - Puppet failure on tools-exec-08 is OK: OK: Less than 1.00% above the threshold [0.0] [12:05:09] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:05:15] YuviPanda, do you know why varnish in deployment-prep might return a 503 when trying to contact parsoid? [12:05:33] Krenair: not sure at all... [12:06:02] RECOVERY - Puppet failure on tools-webgrid-03 is OK: OK: Less than 1.00% above the threshold [0.0] [12:06:04] YuviPanda, what hosts should I be able to ssh to in deployment-prep, as a non-member? [12:06:08] RECOVERY - Puppet failure on tools-exec-01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:06:16] sorry, non-admin. I am a member [12:06:17] Krenair: what do you mean as ‘non-member’? [12:06:19] oh [12:06:24] I said the wrong thing ;) [12:06:28] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [12:06:29] most of ‘em, I suppose? I guess you can’t sudo.. [12:07:07] ok, they seemed to not be responding to ssh (but did to ping). [12:07:18] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [12:07:48] RECOVERY - Puppet failure on tools-exec-catscan is OK: OK: Less than 1.00% above the threshold [0.0] [12:07:50] RECOVERY - Puppet failure on tools-exec-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:08:40] RECOVERY - Puppet failure on tools-webgrid-04 is OK: OK: Less than 1.00% above the threshold [0.0] [12:08:55] YuviPanda, ssh: connect to host deployment-parsoid04.eqiad.wmflabs port 22: Connection timed out [12:09:05] from deployment-bastion [12:09:39] RECOVERY - Puppet failure on tools-exec-cyberbot is OK: OK: Less than 1.00% above the threshold [0.0] [12:10:05] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [12:10:30] RECOVERY - Puppet failure on tools-exec-10 is OK: OK: Less than 1.00% above the threshold [0.0] [12:11:10] RECOVERY - Puppet failure on tools-redis is OK: OK: Less than 1.00% above the threshold [0.0] [12:12:26] RECOVERY - Puppet failure on tools-webproxy is OK: OK: Less than 1.00% above the threshold [0.0] [12:12:48] RECOVERY - Puppet failure on tools-exec-05 is OK: OK: Less than 1.00% above the threshold [0.0] [12:14:55] RECOVERY - Puppet failure on tools-exec-09 is OK: OK: Less than 1.00% above the threshold [0.0] [12:15:01] RECOVERY - Puppet failure on tools-exec-11 is OK: OK: Less than 1.00% above the threshold [0.0] [12:15:17] RECOVERY - Puppet failure on tools-exec-03 is OK: OK: Less than 1.00% above the threshold [0.0] [12:16:11] RECOVERY - Puppet failure on tools-exec-06 is OK: OK: Less than 1.00% above the threshold [0.0] [12:19:17] RECOVERY - Puppet failure on tools-exec-wmt is OK: OK: Less than 1.00% above the threshold [0.0] [12:19:33] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [12:23:35] Krenair: uh oh, I wonder if the host is dead [12:23:59] well I can ping it YuviPanda [12:24:17] 64 bytes from deployment-parsoid04.eqiad.wmflabs (10.68.16.17): icmp_req=37 ttl=64 time=0.368 ms [12:24:19] Krenair: ok let me take a look [12:24:43] Krenair: I can ssh in there. [12:24:55] how strange [12:25:11] Krenair: and I just restarted parsoid there [12:25:15] any effects? [12:25:55] varnish shows 503 on http://en.wikipedia.beta.wmflabs.org/w/index.php?title=0.921072816063511&veaction=edit [12:28:51] (still) [12:32:59] Krenair: I see no logs either. [12:33:59] wat, logs are in /data/project/parsoid?! [12:34:01] sgh [12:57:42] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [13:13:41] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<33.33%) [13:28:40] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [13:44:43] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<11.11%) [14:39:09] Why hello there, sentient lifeforms. [14:39:28] Message received. [14:40:09] I need a bit of help: I don't believe my cronjob is running. :O [14:40:59] It's setup just how it should be, but my ENWP bot isn'r editing. [14:41:37] Then when I run the script manually, it edits, so I know it has nothing to do with my script. [14:42:14] are you using full paths in the cron and the script? [14:43:08] . [14:43:38] 20 * * * * /usr/bin/jsub -N cron-tools.cerabot-1 -once -quiet /usr/bin/python $HOME/wikitool-tasks/datetemplates.py [14:44:06] That's the command as it's written in the cron file. [14:44:45] then I have no idea, please wait for next agent [14:45:01] (the Parsoid issue on beta cluster has been fixed, we followed up on #wikimedia-qa ) [14:45:02] lol. [14:46:07] ceradon: replace $HOME with /data/project/? [14:46:19] ceradon: I’m not sure if $HOME is actually set... [14:47:57] It's set on the command line, of course. Is there a different set of variables set for cronjobs only? [14:49:18] ceradon: cronjobs usually have not much env set. [14:49:28] the tools ones *might* be different, but I highly doubt that [14:53:08] Good morning Eloquence, Coren, andrewbogott_afk! In which channel or under which nicknames can I find the persons who are technically responsible for WMF's wordpress blog and could answer me an ldap question? [14:54:04] Silke_WMDE: emailing Tilman Bayer might be of use. I believe the WMF blog is now hosted by Automattic? [14:54:27] oh [14:54:44] thanks YuviPanda - hehe - you know everything! [14:54:59] Silke_WMDE: :D [14:55:25] I'm having a fight with all those different ldap plugins for WP none of which is really working as expected... :( [14:55:59] ah, damn. [14:56:06] don’t think there’s anyone in ops who can help, tho [14:56:32] ok [14:57:30] YuviPanda: You are right: host blog.wikimedia.org -> blog.wikimedia.org is an alias for wikimediablog.wordpress.com. [14:58:26] Silke_WMDE: yup, plus at the bottom of the blog you will find ‘powered by Wordpress.com VIP hosting' [14:58:40] :D [15:02:37] how does the authentication work on that blog? [15:21:42] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [15:59:57] 3Tool-Labs-tools-Commons-Delinker: CommonsDelinker: Proposal of deletion WITH REPLACEMENT - https://phabricator.wikimedia.org/T84882#932770 (10Elisardojm) 3NEW [16:06:23] 3Tool-Labs: data-* span classes in WSexport ePub produce errors - https://phabricator.wikimedia.org/T84884#932799 (10Nemo_bis) p:5Triage>3Normal a:3Tpt [16:15:07] 3Tool-Labs-tools-Other: data-* span classes in WSexport ePub produce errors - https://phabricator.wikimedia.org/T84884#932823 (10scfc) [17:21:14] YuviPanda: help? :) [17:21:17] limn1 is dead [17:21:19] i rebooted it [17:21:25] can't ssh to it [17:21:28] can't get console output [17:21:34] can't reboot it again [17:21:39] just says "ERROR" [17:23:22] milimetric: What is the instance and project name? [17:23:47] Coren: limn1, project analytics [17:23:53] thanks for chiming in [17:24:07] it's self-hosted puppetmaster style, and not trusty [17:25:12] Ah. Out of memory on the host. [17:25:15] * Coren sighs. [17:25:34] We're this close -> <- to having the new hosts ready. That'll help. [17:26:22] so in the meantime the instance won't come up? [17:26:29] can I shut other instances down to help? [17:26:41] I'm trying to see if there are instances on that virt host I can shut down. [17:26:59] if any of them are in project analytics, I can help triage [17:27:10] Coren: which host? [17:28:50] 1002, andrewbogott. milimetric: yeah, there are several in analytics there. dan=pentaho, nuria-worker1, all the qchris-*, wikimetrics{1,-dev1,-staging1} [17:29:06] Coren: Nuria-worker1 can get deleted [17:29:11] Coren: Is listed as virt1008 for me. [17:29:30] Coren: qchris-* is listed as virt1008 for me. [17:29:30] nuria__: No need to delete, just shutting it down should suffice. [17:29:31] Coren: but I doubt that is controibuting muchto memeory issues [17:29:35] *memory [17:29:59] qchris: Hmm. Maybe nova is confused. Lemme double check. [17:30:09] I'll shut down qchris-* regardless [17:30:28] Coren: I'm shutting down some of those manually [17:30:32] qchris: Nevermind, you're correct -- nova doesn't obey the --host restriction when you specify a project. [17:30:32] Coren: we can also shut down wikimetrics-dev1 [17:30:34] (*@#&#^ [17:30:50] I have to check them manually to be sure. [17:31:14] nuria-worker1 is definitely 1002 though [17:31:24] Coren: "nova list --all-tenants --host virt1002" should get you what you want... [17:31:28] Unless you just did that and it didn't work [17:32:35] andrewbogott: Interestingly enough, if you combine --os-tenant-name and --host, it seems to ignore the latter and gives you all the instances of the project anyways. [17:32:57] hm, dumb! [17:33:04] I just killed three instances on virt1002 [17:33:15] qchris-* is powered down [17:34:02] Coren: I shut down a few of those other instances [17:34:10] so how do we tell limn1 to come back up [17:37:29] Coren: i deleted nuria-worker1 cause I no longer needed it, thank you [17:40:24] milimetric: I can restart it for you. [17:41:48] It should be booting up now. [17:42:48] thanks Coren: all is good now [17:43:10] as always, you guys rule :) thx much [17:43:21] * milimetric will never restart another labs instance as long as he lives [17:44:56] That's going to be less painful when we have more resources. [17:51:55] Coren: super thanks for the prompt help! [17:57:18] andrewbogott_afk: yayaya! [19:25:39] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:50:40] RECOVERY - Puppet failure on tools-exec-cyberbot is OK: OK: Less than 1.00% above the threshold [0.0] [20:00:50] Coren: can you update https://phabricator.wikimedia.org/T72076 with your findings re: DNS server? [20:00:58] (whenever, not now ) [20:06:18] 3Wikimedia-Labs-Infrastructure: Internal DNS look-ups fail every once in a while - https://phabricator.wikimedia.org/T72076#933721 (10coren) Those stats surprise me a little, as I would not have expected DNS queries to be that varied - if I had to venture a guess I would suspect reverse queries for hostnames of... [20:06:26] Ima still give a brief comment to add moar data [20:10:30] It's not immediately clear how to tell dnsmasq to point the instances at the "real" DNS server, and I very much dislike hardcoding resolv.conf though that may be the better solution in the long term. I need to get better data first; ima do a small script that gathers resolution metrics for a while and point it at both dnsmasq and the real server. [20:21:34] (03CR) 10Legoktm: [C: 032] Report IRC using Python and Yuvi's ircnotifier [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/180322 (owner: 10Merlijn van Deen) [20:25:14] (03Merged) 10jenkins-bot: Report IRC using Python and Yuvi's ircnotifier [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/180322 (owner: 10Merlijn van Deen) [20:26:39] valhallasw`cloud: didn't work :/ [20:26:47] legoktm: :/ [20:26:49] oh wait [20:26:50] nvm [20:26:55] I did a `pull` [20:27:02] :D [20:27:21] !log tools.wikibugs legoktm: Deployed 3dc8fd7f3f8fdaacec5998913278179382b8594f Report IRC using Python and Yuvi's ircnotifier wb2-irc [20:27:25] \o/ [20:27:25] Logged the message, Master [20:27:30] majick! [20:28:55] should labslogbot really edit as bot? :/ [20:29:48] (03PS1) 10Legoktm: Send Spam-* to /dev/null [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/180881 [20:32:14] * valhallasw`cloud hits labs-morebots [20:32:16] "VERSION ircbot.py by Joel Rosdahl " [20:32:17] really? [20:33:11] ah, yes, adminbot [20:33:15] valhallasw`cloud: we should replace logging + SAL with logstash some day [20:34:01] It would need a separate logstash index unless we want to lose SAL history after 31 days [20:34:07] but otherwise +1 [20:34:30] s/logstash/elasticsearch/ [20:34:37] bd808: yeah, I guess we can put this in prod ES itself? :) [20:35:25] YuviPanda: yup. anomie has a new feature in beta (or soon to be in beta) that does just that [20:35:34] niiice [20:35:42] clones logstash events and send them to another ES cluster [20:36:14] which will power a new special page to describe api usage [20:36:58] * valhallasw`cloud wonders why adminbot gerrit patches don't show up here [20:45:41] (03CR) 10Legoktm: [C: 032] Send Spam-* to /dev/null [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/180881 (owner: 10Legoktm) [20:46:58] (03Merged) 10jenkins-bot: Send Spam-* to /dev/null [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/180881 (owner: 10Legoktm) [20:47:58] !log tools.wikibugs Updated channels.yaml to: 9ad8e090c8fb06d487b89255562483f08cf354e3 Send Spam-* to /dev/null [20:48:00] Logged the message, Master [20:52:27] mwclient is horrible :< [20:52:39] probably about 50% as horrible as pywikibot [20:52:42] ;-D