[06:28:29] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Refresh zookeeper nodes in eqiad - https://phabricator.wikimedia.org/T182924#4209400 (10elukey) Added conf100[4-6]'s ips to the analytics-in4 firewall rules, forgot to do it yesterday. [07:27:53] 10Analytics, 10Analytics-Kanban, 10Services, 10User-Elukey: Upgrade Kafka Burrow to 1.1 - https://phabricator.wikimedia.org/T194808#4209437 (10elukey) p:05Triage>03High [09:09:09] fdans: o/ [09:09:14] in the alerts I can see [09:09:15] put: Permission denied: user=root, access=WRITE, inode="/wmf/data/archive/geoip":hdfs:hadoop:drwxr-xr-x [09:09:34] the cron seems to run in the root's crontab, not in the hdfs one [09:09:49] yes I was looking at it now :( [09:13:38] elukey: so I should change the user in the cron to hdfs, or append sudo -u hdfs to the hdfs command? [09:14:40] I think the second since the user hdfs won't have permission to copy the files from /usr/share/geoip [09:15:59] it could be an option yes [09:23:50] elukey: just tested, files copied correctly, patch's here https://gerrit.wikimedia.org/r/#/c/433335/ [09:26:36] fdans: done! [09:27:02] thank youuuu elukey [09:28:04] elukey: no action required since I already copied this week's files to hdfs, but I realize that I should have done it with someone else's supervision, since I was using sudo -u hdfs :( [09:28:06] sorry for that [09:28:59] that's fine :) [09:35:00] hi guys, is it ok if I restart varnishkafka-webrequest on cp1008/pinkunicorn? [09:35:50] vgutierrez: anytime, it doesn't push to our kafka topics, all good [09:36:28] mmm now that I think about it it might, but it is not handling real traffic [09:36:31] so super fine [09:39:04] ack :) [09:49:27] joal: druid upgraded to 0.11.0-2 in labs [09:50:08] I verified and it seems that it deployes the druid-parquet extension correctly [10:26:55] heya [10:27:25] elukey: something weird is happening on wikimetrics-01.eqiad.wmflabs [10:27:37] It happened a while ago, I rebooted the box [10:27:41] and it's happening again now [10:27:42] OperationalError: (2006, 'MySQL server has gone away') [10:28:03] this used to happen all the time, and we put in this keepalive hack: [10:28:23] https://github.com/wikimedia/analytics-wikimetrics/blob/master/wikimetrics/database.py#L271 [10:29:01] which now seems to be what's actually causing the error. If I reboot it, the problem goes away... so it's hard to debug [10:29:30] do you know anything that might've changed? The SQL servers it uses are the labs ones. [10:29:36] milimetric: good (very early) morning! :) [10:29:53] :) [10:30:37] no idea if anything changed in labs, do we know exactly what are the hosts ? [10:31:13] maybe mysql shutdown the connection for some reason? [10:35:26] if I remember correctly mysql was shutting down connections, which was breaking it all the time [10:35:35] so that's why we put this keepalive thing [10:35:56] checking on the hosts, one sec [10:36:47] s1.analytics.db.svc.eqiad.wmflabs [10:36:54] (03CR) 10Fdans: "Works great, just a comment on comments!" (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/433155 (https://phabricator.wikimedia.org/T194431) (owner: 10Milimetric) [10:36:58] and tools.labsdb/s52261__wikimetrics [10:39:00] full error report is: [10:39:04] https://www.irccloud.com/pastebin/pMkrLU6E/ [10:40:14] so I guess basically the same thing is happening (mysql is closing the connection) but that ping_connection isn't working to revive it [10:44:48] milimetric: but this goes away simply restarting uwsgi right? Not rebooting the box? [10:44:54] yes [10:45:01] ah ok [10:46:17] ah s1.analytics.db.svc.eqiad.wmflabs. points to dbproxy1010, not sure if they have done maintenance [10:46:42] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [10:46:42] last one was days ago [10:46:46] this is me --^ [10:47:18] milimetric: so in the picture there is not only mysql, but a dbproxy in the middle [10:47:27] basically haproxy that balances connections [10:48:27] yeah, I remember them adding that [10:48:45] ok, I'll try to figure out a way to reliably re-establish the connection [10:50:14] I am wondering if https://github.com/wikimedia/analytics-wikimetrics/blob/master/wikimetrics/database.py#L271 is triggered, are there any logs about the retries? [10:54:21] elukey: hm, oh, it might only be triggered on the mediawiki dbs and not the wikimetrics one, but check the log of the queue, it's there: sudo journalctl -f -u wikimetrics-queue [10:56:33] ah yes, didn't know where to llok [10:56:35] *look [11:00:39] I've gotta go take care of the baby and start the day and stuff, no worries Luca I'll figure this out, I think it's more a matter of updating the code to get rid of the hack and re-establish the connection properly than a system thing. But there are some weird reports that the server is out of RAM or something [11:01:11] I already checked the max_allowed_packet and wait_timeout, and both of those are set to reasonable settings, and we don't want to keep the connection around forever, we just want to reconnect whenever it dies [11:02:27] ack, let me know if I can help! [11:05:39] * elukey afk for a bit! [11:12:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Index and store page preview agreggates on Druid so they are visible in pivot/superset - https://phabricator.wikimedia.org/T192305#4209902 (10Tbayer) Thanks! Already [[https://superset.wikimedia.org/superset/explore/druid/344/?form_data=%7B%22color_scheme%... [13:23:14] (03PS4) 10Milimetric: Stop the bar chart from incrementing its height [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/433155 (https://phabricator.wikimedia.org/T194431) [13:25:55] ottomata: o/ [13:26:01] I am building burrow 1.1 [13:26:21] yeeehaw [13:26:34] https://github.com/linkedin/Burrow/pull/330 looks promising [13:30:22] yesss [13:34:15] elukey: ok if i do another slow rolling restart of jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/433214/ ? [13:37:24] ottomata: sure sure, I decided to postpone the zookeeper swap on Monday so we'll not have too many changes in flight for the weekend (us gone, the hackathon, etc..) [13:38:42] ok! [13:43:43] one weird thing that I noticed this morning [13:44:11] kafka acls --list tries to add znodes in zookeeper as opposed to simply read it [13:44:50] I was checking some errors on conf1004 (the current leader, the only one accepting writes) but then I realized that those were the same even before the upgrade [13:45:17] all coming from kafka-jumbo, I am pretty sure due to the exec that checks for the ANONYMOUS perms [13:45:30] nothing problematic but I was a bit surprised about how kafka acls --list works [13:45:45] (the errors were related to znodes already there0 [13:47:38] hm. [13:47:42] elukey: not sure i understand [13:48:45] you sure --list tries to add the znodes? maybe the --list exec is failing the check and the --add is being run after all? [13:53:11] ottomata: I am sure, I tried to execute the --list while checking the logs [13:53:22] (on zookeeper) [13:59:53] (03PS1) 10Milimetric: Try NullPool for Wikimetrics db [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/433379 [14:00:04] (03CR) 10Milimetric: [V: 032 C: 032] Try NullPool for Wikimetrics db [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/433379 (owner: 10Milimetric) [14:02:19] (03PS1) 10Milimetric: Fix bad NullPool use [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/433381 [14:02:37] (03CR) 10Milimetric: [V: 032 C: 032] Fix bad NullPool use [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/433381 (owner: 10Milimetric) [14:02:52] and the logs say " cannot create znode because it already exists?" [14:06:56] more or less yes [14:11:42] I am testing burrow in labs now, seems working fine [14:19:42] elukey: are you going to barcelona? [14:20:10] nope [14:20:14] fdans: ? mforns? [14:20:17] I am going to Valencia :) [14:20:20] leila is just asking :) [14:20:28] wondering who from analytics is going [14:20:31] ahh the hangtime! I forgot about it! :) [14:20:54] lemme join [14:58:51] ottomata: I was kicked out and now it's showing me that I'm alone in the room. :D [14:59:08] anyway it's the end of the hour. [14:59:10] oh yes! [14:59:12] we all said goodbye! [14:59:16] didn't realize you got kicked [14:59:28] ottomata: elukey: thanks for the hangtime. [14:59:32] :) [14:59:33] :) [14:59:39] ottomata: burrow 1.1 deb ready https://gerrit.wikimedia.org/r/#/c/433390/ [14:59:44] shall I deploy? [15:04:31] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog: Spike: Explore an API for logging events sampled by session - https://phabricator.wikimedia.org/T168380#4210314 (10Jhernandez) [15:04:58] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog: Spike: Explore an API for logging events sampled by session - https://phabricator.wikimedia.org/T168380#3362729 (10Jhernandez) p:05Low>03Normal [15:06:16] ottomata: lzia naaa me staying in madrid, I was tempted though [15:06:29] lzia: but i'll see you in ZA! [15:07:42] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog: Explore an API for logging events sampled by session - https://phabricator.wikimedia.org/T168380#4210320 (10Jhernandez) [15:14:44] elukey: go for it for sure! [15:28:01] ottomata: I'd need 10 mins to fix one thing, can we delay ops-sync? [15:31:54] yes [15:31:57] (finishing an email too) [15:43:48] ottomata: i am in bc if you wnat [15:43:56] OH coming [15:49:26] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.798e+08 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [15:54:27] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 47 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [16:03:39] (03CR) 10Fdans: [C: 032] Stop the bar chart from incrementing its height [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/433155 (https://phabricator.wikimedia.org/T194431) (owner: 10Milimetric) [16:10:12] (03PS1) 10Milimetric: Fix date parsing [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433414 [16:11:51] (03CR) 10Ottomata: Fix date parsing (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433414 (owner: 10Milimetric) [16:13:50] (03CR) 10Milimetric: "apparently with wmf_raw.mediawiki_private_cu_changes it didn't work. But this patch appears to work, check it out:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433414 (owner: 10Milimetric) [16:19:13] A-team - Anything you'd like me to mention on SoS? [16:19:30] (03CR) 10Ottomata: [C: 031] "k!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433414 (owner: 10Milimetric) [16:19:57] oh joal I thought I was going, but I'm happy to finish this other work instead [16:19:58] joal, on my side virtualpageview_hourly in hive and in druid/turnilo [16:20:10] Noticed mforns :) [16:20:13] Thanks ! [16:20:36] joal: ya maybe mention turnilo, and that we woudl like to (soon?) redirect pivot.wm.org -> turnilo [16:20:37] joal: on my end geowiki -> geoeditors dataset: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors [16:21:32] (03CR) 10Milimetric: [V: 032 C: 032] Fix date parsing [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433414 (owner: 10Milimetric) [16:21:45] !log deploying refinery [16:21:47] hm, joal could you also mention modern event platform stuff? just say that we're going to be doing some interviews with users of event stuff over the next month or so, and that anyone that is particularly interested should reach out to me (in case I don't know them yet) [16:21:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:22:15] ottomata: Will even put that as a Callout :) [16:22:51] Callout? [16:23:17] in SoS there is this section where you pout stuff for people not to miss [16:23:24] for linking: https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019/TEC2:_Modern_Event_Platform and https://phabricator.wikimedia.org/T185233 [16:23:29] ah ok cool [16:26:14] ottomata: I am restarting burrow on kafkamon1001, some alerts might fire [16:30:09] elukey: Should I mention Burrow 1.1 ? [16:30:58] (03PS1) 10Milimetric: Fix logging for a couple of scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433419 [16:31:11] joal: sure :) [16:34:06] (03CR) 10Milimetric: "Related puppet change is: https://gerrit.wikimedia.org/r/#/c/433419/" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433419 (owner: 10Milimetric) [16:37:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 4.19e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [16:37:33] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is CRITICAL: 3.129e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:38:53] already recovered, due to the new burrow package [16:38:57] (restarts needed) [16:44:32] RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [16:44:42] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:59:48] * elukey off! [17:15:47] (03PS2) 10Milimetric: Fix logging for a couple of scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433419 [17:15:54] (03CR) 10Milimetric: [V: 032 C: 032] Fix logging for a couple of scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433419 (owner: 10Milimetric) [17:19:26] (03PS1) 10Milimetric: Fix docopts format [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433428 [17:19:42] (03CR) 10Milimetric: [V: 032 C: 032] Fix docopts format [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433428 (owner: 10Milimetric) [17:45:15] ottomata: can you merge https://gerrit.wikimedia.org/r/#/c/433420/ ? The deploy it needed is done now [17:45:23] !log refinery deploy is done [17:45:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:51:01] done, running puppet on an03 [17:52:14] milimetric: hm [17:52:16] not sure what this is about [17:52:16] Notice: /Stage[main]/Profile::Analytics::Refinery::Job::Sqoop_mediawiki/Cron[refinery-sqoop-mediawiki-private]/ensure: created [17:52:16] "-":78: command too long [17:52:34] errors in crontab file, can't install. [18:02:44] ottomata: we faced that with elukey the other day [18:02:56] ottomata: cron commands can't be more than 998 chars [18:06:05] ottomata: that's separate from my change, but looks like this commit broke that limit joal is talking about: [18:06:06] https://github.com/wikimedia/puppet/commit/3124831dd0330e38605c78e79ef9ffeb1e31b5b5 [18:06:37] and yes... that is a VERY long command. I'll submit a patch with the shorthands for the flags, one sec [18:06:47] indeed milimetric - elukey has mentioned it in scrum the other day - we should make decision on how to solve that [18:07:00] joal: shorthand not good? [18:07:37] easy way: [18:07:39] yeah use short flags [18:07:42] oh sorry [18:07:44] you already said! [18:07:45] :p [18:08:53] (doing now) [18:17:47] ok, done https://gerrit.wikimedia.org/r/#/c/433433/ [18:17:50] (added yall) [18:29:39] milimetric: We even hadn't thought about that with elukey :( [18:29:45] milimetric: thanks for the good catch :) [18:32:12] :) I like lazy fixes [18:41:51] gonna run to a cafe back shortlyyyy [19:28:23] milimetric: ah nice! Thanks! [19:56:35] fdans: wow this sql_magic thing works and is awesome [19:56:49] adding docs to wikitech [19:57:00] I KNOW I'M ON FIRE ON JUPYTER RITE NOW [19:57:41] this is so much better than stupiding with mini python scripts like I was doing before [19:57:48] thank you ottomata [20:00:25] pretty cool [20:00:29] that was WAY easier than i thought it would be [20:00:29] wow [20:05:10] wow it even works via spark hive fdans [20:16:08] fdans: https://wikitech.wikimedia.org/wiki/SWAP#sql_magic [20:47:11] hehe, I'm so confused why someone would do that over just "hive" [20:47:22] but open to it! Whatever works :) [21:07:38] hehe