[00:50:31] jdlrobson: password changed, you should always use "mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf" instead of copying the password [00:50:41] jdlrobson: that file is updated anytime the password changes [00:50:49] (it was shared publicly, that's why it was changed) [09:32:36] 10Analytics-Kanban, 10User-Elukey: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303#3455713 (10elukey) If I made correct calculations the top 20 tables of the log database on store/slave weight ~2.6TB, so even cutting their size in a half would be a massive win for us. Just sent an email... [12:49:58] so something interesting discovered today is that the eventbus hosts are making a ton of DNS requests constantly for statsd.eqiad.wmnet [12:50:01] https://github.com/wikimedia/eventlogging/blob/be89e0fd08e0d175364c3fe6e0f288f231724576/eventlogging/handlers.py#L569 [12:50:32] we added the IP to /etc/hosts on kafka2002, depooled acamar again and it didn't fail the Pybal checks [12:59:16] huh. [12:59:25] ottomata: o/ [12:59:54] we just tried to apply the same fix to all kafka2*, no outage when acamar was depooled [12:59:57] you sure that's code where the request comes from though? [13:00:18] it is the only one that resolves statsd afaics [13:00:20] no? [13:00:48] statsd_writer should only be used it if is configured as an eventlogging output [13:01:03] and maybe that hostname is not statsd, mmmmm [13:01:03] also [13:01:05] weird [13:01:09] no, that should be [13:01:10] but [13:01:16] that line will only be executed once [13:01:21] that is a coroutine [13:01:33] only the stuff in the while 1: part would be looped [13:01:50] but, even so, eventbus doesn't use statsd_writier, i think we don't use that anywhere anymore [13:02:16] but [13:02:29] service.py uses a statsd tornado plugin to send request stats [13:02:44] https://github.com/sprockets/sprockets.mixins.statsd [13:04:34] Note that the socket for [13:04:35] communicating with statsd is created once upon module import and will not [13:04:37] change until the application is restarted or the module is reloaded [13:05:17] maybe there is something that prevents that IP to be cached [13:05:20] as it is stated [13:05:29] anything else interested in statsd ottomata ? [13:07:55] on eventbus hosts? I don't think so [13:08:21] so if the statsd writer is not used (and appartently it is efficient) it might be sprockets.mixins.statsd? [13:08:52] iirc that's the only thing that would talk to statsd, unless node-rdkafka doe? [13:08:55] s? [13:08:57] looking [13:08:59] sorry [13:08:59] haa [13:09:02] not node-rdkafka [13:09:05] kafka-pythong [13:10:16] yeah, pretty sure its just the sprockets mixin [13:10:25] oh [13:10:31] jmxtrans [13:10:31] ? [13:10:36] but that's not eventlogging eventbus [13:10:49] since the kafka brokers run there too [13:11:18] jmxtrans runs and there and sends jmx metrics [13:11:19] to statsd [13:11:29] but that's not hte eventlogging process [13:12:11] let me check on kafka1001 the process that does the queries, didn't check before [13:13:12] not easy to get though [13:16:58] so I used netstat -aup [13:17:15] and I can see a ton of things for jmxtrans [13:17:31] let me stop jmxtrans on kafka1001 to double check [13:18:01] but it wouldn't make any sense [13:19:20] ok in fact it is not that one [13:19:29] :D [13:20:12] nono it must be eventbus [13:21:23] plus in graphite I can see "service" populated for eventbus [13:21:40] kafka goes through jmxtrans that is fine [13:22:54] ya then it must be the sprockets mixin [13:31:07] ottomata: can it be possible that the code related to sprockets.mixins is executed multiple times rather than only once [13:31:10] ? [13:32:03] I am thinking out loud, this thing doesn't make much sense :D [13:35:28] The default statsd server that is used is ``localhost:8125``. The [13:35:28] ``STATSD_HOST`` and ``STATSD_PORT`` environment variables can be used to [13:35:31] set the statsd server connection parameters. [13:35:40] but I can't find any trace of them in the plugin code [13:37:37] looking [13:37:42] i don't know much about how the mixin thing works [13:37:43] so it is possible [13:40:57] might be in sprockets.clients.statsd that is a dependency [13:42:07] https://github.com/sprockets/sprockets.clients.statsd/blob/master/sprockets/clients/statsd/__init__.py [13:42:22] yaeh [13:42:25] https://github.com/sprockets/sprockets.clients.statsd/blob/34daf6972ebdc5ed1e8fde2ff85b3443b9c04d2c/sprockets/clients/statsd/__init__.py [13:42:25] oh haha [13:42:27] we foudn the same :) [13:43:10] hmmmm yeah elukey, it looks like it makes the socket cone, but without the addr [13:43:21] the STATSD_ADDR is provided in the sendto call [13:43:25] maybe that is doing a lookup [13:43:41] yeah I wanted to say the sme [13:43:42] same [13:43:51] maybe it is so stupid that needs a raw IP [13:44:56] yeah [13:45:03] connection_string = os.getenv('STATSD') [13:45:03] if connection_string: [13:45:03] url = urlparse.urlparse(connection_string) [13:45:04] STATSD_ADDR = (url.hostname, url.port) [13:45:11] elukey: i betcha if we did the lookup ourself [13:46:15] looking for where eventlogging ets STATSD or STATSD_HOST [13:46:48] ah, its in systemd [13:46:49] env var [13:46:54] Environment="STATSD_HOST=statsd.eqiad.wmnet" [13:47:07] ya elukey, i betcha if we made eventlogging override that env var with a ip [13:47:11] doing the host lookup ourself [13:47:18] socket.sendto will stop doing that [13:47:19] but this is crazy! [13:47:20] for every metric [13:47:34] sigh [13:47:35] elukey: i think its a udp socket python thing [13:47:54] if the socket.sendto is given a hostname [13:47:55] I'd expect the plugin to make the dns resolution [13:47:59] its gotta do a dns lookup, ya? [13:48:02] if it finds that it is not an IP [13:48:06] oh yes [13:48:11] or, make a socket with an endpoint once [13:48:47] else: [13:48:47] STATSD_ADDR = (os.getenv('STATSD_HOST', 'localhost'), [13:48:48] int(os.getenv('STATSD_PORT', 8125))) [13:48:48] oh elukey [13:48:54] this one is flawed [13:49:02] that's exactly what the eventlogging statsd_writer does [13:49:05] addr = socket.gethostbyname(hostname), port [13:49:06] then [13:49:08] in the loop [13:49:11] exactly [13:49:12] sock.sendto(stat.encode('utf-8'), addr) [13:49:17] but with the IP [13:49:20] ya [13:49:24] ok we found it [13:49:38] elukey: i can make a quick patch to el for ya [13:50:15] please do it mr ottomata [13:50:42] (03PS2) 10Milimetric: Implement Wikistats metrics as Druid queries [analytics/refinery] - 10https://gerrit.wikimedia.org/r/365806 (https://phabricator.wikimedia.org/T170882) [13:51:15] ottomata: task is https://phabricator.wikimedia.org/T171048 [13:51:17] going to update it [13:57:49] elukey: you have a way to test/repro this in prod? [13:57:58] cause I can apply manually on a codfw host and we can double check if my fix works [13:58:28] 10Analytics, 10EventBus, 10Operations, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048#3456304 (10elukey) Me and @ema set up an experiment, namely adding `10.64.32.155 statsd.eqiad.wmnet` in /etc/hosts on kafka2002 (and not on the othe... [13:59:05] ottomata: sure, just tcpdump and see no statsd.e.w related traffic :D [13:59:19] ema applied a hack to /etc/hosts [13:59:29] so make sure that there is nothing in there [13:59:38] elukey: https://gerrit.wikimedia.org/r/#/c/366561/ [14:00:24] yeah it looks good [14:00:43] but then does it mean restarting the service if we change statsd.eqiad.wmnet's IP in DNS? [14:00:53] ya it would [14:00:55] yes.. [14:01:00] hello ema :) [14:01:02] o/ [14:01:04] :) [14:01:50] ottomata: mmmm one thing - the "localhost" default is necessary? [14:02:04] no, but it is what the lib defaults to [14:02:20] hmm, the service supports some sighup reloading [14:02:30] yeah but I'd let it confined in the lib.. we might just say [14:02:31] ahhh but it doesn't matter [14:02:35] if os[]: then .. [14:02:37] because the lib caches the addr [14:02:54] i suppose, sendto.localhost isn't going to do remote dns [14:03:00] ok [14:03:04] but it will fail brutally :D [14:03:36] also say we put an IP for STATSD_HOST [14:03:55] will gethostbyname be happy? [14:04:05] checking [14:04:22] i checked that [14:04:25] it just returns the ip [14:04:45] super [14:05:08] ok, elukey i can apply this to a codfw box in prod [14:05:09] buuuut [14:05:21] can we verify that it actually solves the problem? [14:05:51] ottomata: sure, it is sufficient to depool acamar [14:06:05] ok cool, applying on kafka2001 [14:07:14] removed ema's statsd override from /etc/hosts [14:07:35] so I was thinking of something similar for T151643, but perhaps caching the result and resolving every now and then would be better than doing it one-off? [14:07:39] T151643: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643 [14:08:15] also see T104442 [14:08:15] T104442: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442 [14:08:17] ema: yeah i thought about putting it in a sighup handler that is already there [14:08:17] but [14:08:23] the lib caches the value of the env var itself [14:08:32] yeah [14:08:42] https://github.com/sprockets/sprockets.clients.statsd/blob/34daf6972ebdc5ed1e8fde2ff85b3443b9c04d2c/sprockets/clients/statsd/__init__.py#L44-L52 [14:08:49] altough, ahha, it is global [14:08:51] so we could keep hacking [14:08:51] :p [14:08:52] :/ [14:08:55] nope :D [14:09:13] I think that this is a good short/medium term solution [14:09:15] agree [14:09:33] maybe we can make a pull request on the lib and it will get be merged [14:09:35] who knows [14:09:54] I am doing it, this is a bug in their code [14:10:27] let's see first if we fix the issue [14:10:29] cool thanks [14:10:33] ok restarting eventubs on kafka2001 [14:11:13] tcpdump with host statsd.eqiad.wmnet still shows stuff, buuuut, i betcha tcpdump is smart [14:11:22] of course it is :) [14:11:28] elukey: depool acamar? [14:12:36] elukey: when we are done here, lets do ops sync :) [14:14:16] so I don't see anything more with sudo tcpdump -vvv -l -n dst port 5 [14:14:19] that is super good :) [14:14:53] ottomata: since it might cause a little outage, can we apply the fix to kafka200[23] before depooling acamar? [14:15:06] sure, few mins [14:15:07] maybe to 2/3 hots [14:15:11] *hosts [14:15:12] super [14:18:38] elukey: 2/3 hosts so we can see the 3rd fail? :) [14:20:27] yep! [14:21:29] ok done elukey [14:21:33] applie on 2001 and 2002 [14:21:39] 2003 left running as is [14:22:13] ema: do you have a minute to depool acamar? [14:23:23] (removed entries in /etc/hosts on kafka200[23]) [14:28:27] elukey: sure [14:28:39] <3 [14:31:11] elukey: you keep an eye on lvs2003? [14:31:35] yep [14:32:29] elukey: I'm seeing the timeouts on kafka2003 (resolving cp1008) [14:32:32] Jul 20 14:32:23 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2003.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds. [14:32:42] that's the only one without the fix \o/ [14:32:49] ottomata: --^ [14:32:49] great [14:33:04] ema: feel free to repool [14:33:24] ok repooling [14:34:14] ok cool [14:34:26] i'm going to do some EL deploys today anyway [14:34:28] will do ones for eventbus too [14:34:30] maybe that first [14:34:42] elukey: merge my change? [14:35:06] ottomata: +2 [14:35:42] ottomata: the tests failing are expected right? Still need to fix them? [14:35:46] cool [14:35:52] elukey: let's do ops sync while we stil lhave time :) [14:36:00] (thanks ema! ) [14:36:12] am in bc [14:36:17] pleasure! [14:36:35] ottomata: let me go outside the co-working, a lot of noise, 1 min [14:37:20] k [14:37:29] mforns: can you log into deployment-eventlogging03 ? [14:37:42] ottomata, trying [14:38:00] I used to [14:38:08] i wonder if its disk is full again or something [14:38:11] ottomata, no, I can not [14:38:20] aye, i can log into other deployment-prep instances fine [14:38:25] Permission denied (publickey). [14:38:27] maaan, i want to wipe that one and set up a new beta el [14:38:45] mforns: objections to a shorter name? [14:38:49] deployment-el01 maybe? [14:38:49] no [14:38:56] or [14:38:59] maybe we keep it descriptive [14:39:02] deployment-eventlog01 [14:39:05] like in prod sorta [14:39:10] the no was to: no objection [14:39:15] k cool [14:39:48] ottomata, yes, deployment-eventlog01 is good imo [14:39:59] k [14:51:56] 10Analytics, 10Scoring-platform-team-Backlog, 10revscoring, 10artificial-intelligence: [Investigate] Hadoop integration for ORES training - https://phabricator.wikimedia.org/T170650#3438153 (10Halfak) [15:00:38] ping ottomata [15:01:09] ping elukey [15:21:32] 10Analytics, 10EventBus, 10Operations, 10Patch-For-Review, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048#3452124 (10fgiunchedi) FYI caching statsd name forever also has its problems when failing over statsd as many services need re... [15:31:58] hello [15:32:26] on beta cluster I have removed a bunch of role::eventlogging::analytics::* classes which were applied project wide ( via https://horizon.wikimedia.org/project/puppet/ ) [15:32:33] that broke puppet on most instances [15:32:44] ottomata: for after your meeting :D [15:32:59] the class should be applied per prefix or per instance :} [15:45:03] oh sorry [15:45:05] just saw these [15:45:08] responded in mwsec too [15:45:13] they were appplied project wide?!! [15:45:22] OH MY GOODNESS [15:45:23] sorry hashar [15:45:28] that's my fault [15:45:28] i just applied those [15:45:32] but i meant to apply it to JUST MY NEW NODE [15:45:35] i was on the wrong tab [15:46:12] thank you hashar that was the right thing to do [15:48:58] ping elukey [15:59:38] 10Analytics, 10EventBus, 10Wikimedia-Stream: Add tags to recentchange stream - https://phabricator.wikimedia.org/T171182#3456794 (10Nirmos) [16:00:10] ottomata: some classes where applied project wide [16:00:13] ottomata: :-} [16:00:17] no worries! [16:15:38] 10Analytics, 10Discovery, 10Discovery-Analysis: Public folder on stat1005 for Discovery's A/B test reports - https://phabricator.wikimedia.org/T171187#3456874 (10mpopov) [16:35:05] 10Analytics, 10Discovery, 10Discovery-Analysis: Public folder on stat1005 for Discovery's A/B test reports - https://phabricator.wikimedia.org/T171187#3456965 (10Ottomata) Hm, would you mind if the report url was just analytics.wikimedia.org/datasets/discovery/reports/ ? If not, you could put stuff in publi... [16:48:52] * elukey off! [16:55:52] nuria_: hash ar had me make a new instance, and now i can log in [16:56:06] going to see if a reboot of the old one works [16:56:20] ottomata: k, so we scrape the old one, we need to update docs (i will do that) how is the new one called? [16:56:20] then i don't have to bother with making a new one (today) [16:56:30] lemme try fixing the old one, it sounds like a reboot might fix it [16:56:34] ottomata: k [16:56:57] elukey:, mforns : i think we can entirely delete this data too: https://meta.wikimedia.org/wiki/Schema_talk:PageCreation [16:57:04] as it is on data lake [16:57:12] have sent aaron a note about it [16:57:15] aha [16:58:00] nuria_, the only difference is the availability no? data lake is updated monthly [16:58:16] mforns: no, we also have page-creation from mw now [16:58:25] mforns: which is amuch better vs of that schema [16:59:46] mforns: see: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki/revision/create [17:00:43] 10Analytics, 10Discovery, 10Discovery-Analysis: Public folder on stat1005 for Discovery's A/B test reports - https://phabricator.wikimedia.org/T171187#3457066 (10mpopov) 05Open>03Resolved a:03mpopov We're okay with that :) [17:00:47] nuria_, of course! [17:06:30] nuria_: oh, what did we decide? i should just stop tranquility, right? [17:11:43] ottomata: yes please [17:11:49] ottomata: webrequest is updated hourly [17:11:53] ottomata: and daily [17:13:27] ok [17:14:52] !log killed tranquility instances tranq-banners and tranq-netflow running on druid1003 in joal's screen sessions [17:14:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:24:50] hey elukey fyi [17:24:56] i see in my.cnf for db1046 [17:25:03] character_set_server = binary [17:25:04] character_set_filesystem = binary [17:25:04] collation_server = binary [17:32:38] nuria_: do you know how to enable tokudb? [17:32:40] i get [17:32:48] Can't open shared library '/usr/lib/mysql/plugin/ha_tokudb.so' (errno: 17, cannot open shared object file: No such file or directory) [17:32:51] whenn i configure it in my.cnf [17:35:44] ottomata: yessir [17:35:54] ottomata: it is on wikitech letmme see [17:36:21] ottomata: /etc/init.d/mysql start--default-storage-engine=tokudb [17:36:32] https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster [17:36:55] ottomata: let me try to restart mysql on beta [17:37:17] naw nuria_, something is wrong [17:37:19] Can't open shared library '/usr/lib/mysql/plugin/ha_tokudb.so' [17:37:26] i've got it configured as default storage [17:37:32] ottomata: new machine or old machine? [17:37:35] new machine [17:37:39] old machine is locked [17:37:40] can't get in [17:37:44] ottomata: yes [17:37:57] ottomata: what is new machien? [17:38:02] *machine's name [17:38:25] deployment-eventlog02 [17:39:31] i'm going to proceed for now without toku [17:40:08] ottomata: i think insertions will not work [17:40:15] ? they'll work [17:40:16] sure [17:40:18] they work in vagrant [17:40:21] no toku there [17:40:46] ottomata: ok, the el code specifies that engine but maybe it doesn't fail, ok [17:41:16] it falls back to innodb [17:41:46] k [17:41:47] pretty sure [17:42:39] ottomata: does this machine have a different os? [17:42:53] no [17:42:54] trusty [17:43:11] aham [17:43:13] what innodb not supported?! [17:44:00] wait no [17:44:03] its ok, i had it misconfigured [17:44:04] cool. [17:44:42] ottomata: k, you did installed mysql w/o puppet right? [17:45:41] ya [17:45:49] maridb-server-5.5 [17:46:48] ha, sigh scap never works [17:46:52] when i want it to :) [17:48:58] ottomata: it never works on beta labs [17:50:02] ottomata: k, let me know when you are done testing and i will try to install tokudb , i think i did this on teh older one [17:52:46] wow nuria_ no wonder the old one kept filling up [17:52:54] ottomata: yes? [17:53:16] lots of events [17:53:20] all thos mobilewikiapp things [17:53:28] ottomata: ya, specially page create, they are driven by tests [17:54:31] ottomata: what could we do here? disable mysql consumer so they only go to files? [17:54:54] it goes to files too [17:55:10] ottomata: turn off el and only turn it on at will? [17:56:50] sample? [17:56:56] naw that would be annoying [17:57:03] have them not send so many events? [17:59:34] ottomata: code is the same [18:00:24] ottomata: so they are sensing at an equal rate that prod events , more so, as tests execute some code over 7 over [18:00:50] aye yikes [18:00:59] ottomata: we just need to run the event logging clearner script with less tha 1 week [18:01:14] ottomata: and deleted anything that is older [18:01:18] ottomata: right? [18:01:47] ottomata: files will be moved to archive and maybe we can also delete those in a a more frequent schedule [18:02:21] ottomata: does running el cleaner script seem like a sane solution? [18:02:48] ottomata: could we setit up on puppet so it runs on beta with a smaller frequency? even on a cron [18:04:07] yae [18:05:19] ottomata: ok, let's do that then [18:05:26] ottomata: shoudl i file a ticket? [18:05:45] ya sure [18:07:21] 10Analytics, 10Analytics-EventLogging: Run eventlogging purging script on beta labs to avoid disk getting full - https://phabricator.wikimedia.org/T171203#3457322 (10Nuria) [18:07:25] ottomata:k, done [18:10:35] yagh so i'm trying to test the whole pipeline, and the mirror maker from main to analytics kafka clusters doesn't seem to be owrking inbeta right now [18:10:36] yarrr [18:13:20] ottomata: ARGH [18:13:32] ottomata: that was working as of me testing page-create fixes last week [18:13:43] oh really? [18:13:45] tha'ts cool [18:13:49] hm [18:13:49] then whyyyy [18:13:50] ottomata: as in [18:13:53] eventbus is main kafka is working [18:13:56] ottomata: i saw events flowing in np [18:14:01] their just not going to the kafka cluster the eventloging subscribes to [18:14:53] AHhttps://deployment.wikimedia.beta.wmflabs.org/wiki/Test03 [18:14:55] ah [18:14:56] wait i think its going now [18:15:04] ok, mirror maker needed a bump on both servers [18:17:19] AHHH [18:17:23] ottomata: k, good, cause otherwise .. man [18:17:24] i need to delete the other el box [18:17:29] nuria_: i'm doing that, s'ok? [18:17:32] ottomata: yes please, i will update docs [18:17:34] its actually running,a nd consumers there are balancing [18:17:40] taking over my new ones :) [18:18:07] !log deleted instance deployment-eventlogging03 in favor of new instance deployment-eventlog02 [18:18:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:18:18] ottomata: k, i will also try to install tokudb when you are done with testting [18:20:25] nuria_: ungh, i don't know what is happening, now its not working [18:20:29] eventbus says its working fine [18:20:38]