[02:19:59] 10Analytics, 10MediaWiki-General-or-Unknown, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10User-Tgr: Make aggregated MediaWiki Pingback data publicly available - https://phabricator.wikimedia.org/T152222#4034056 (10CCicalese_WMF) 05Open>03Resolved Aggregated MediaWiki pingba... [02:20:15] 10Analytics, 10MediaWiki-General-or-Unknown, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10User-Tgr: Make aggregated MediaWiki Pingback data publicly available - https://phabricator.wikimedia.org/T152222#4034058 (10CCicalese_WMF) [05:02:18] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): For new authors on C_Gerrit_Demo, provide a way to access the list of Gerrit patches of each new author - https://phabricator.wikimedia.org/T187895#4034191 (10srishakatux) [07:15:09] !log disable Camus on an1003 to allow the cluster to drain - prep step for an100[123] reboot [07:15:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:21:13] ok so my plan for this morning: [07:21:26] 1) drain a bit the cluster to ease the reboot of an100[123] [07:21:49] 2) check status of an1070 (logs, metrics, etc..) and add an1071 if everything is fine [07:22:11] 3) start the druid reboots [07:44:59] 10Analytics, 10MediaWiki-General-or-Unknown, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10User-Tgr: Make aggregated MediaWiki Pingback data publicly available - https://phabricator.wikimedia.org/T152222#4034303 (10Kghbln) @CCicalese_WMF A great effort. Thanks a lot for tackling this! [08:18:25] ok there are some spark jobs and a hive one currently [08:18:33] I'll start with an1001 failover to 1002 [08:24:49] ok 1001 is rebooting [08:30:54] now 1002's turn [08:33:19] bearloga: o/ - online by any chance? [08:38:19] Hi elukey [08:38:26] what's up with hadoop? [08:40:07] joal: what do you mean? [08:40:29] I've seen your messages earlier elukey, and wondered if there was anything I could help with [08:40:36] ahhhh! [08:40:42] I thought that something was exploding [08:40:50] no no no [08:40:57] 1001/2 failedover/rebooted/etc.. [08:40:58] You'd have heard me shouting louder if so :) [08:41:05] awesome [08:41:32] From being able to look t yarn UI, I see you've made an1001 back up as master :) [08:41:45] yep! [08:41:53] Many thanks for that :) [08:41:59] I was about to do 1003 but there is a hive query running :( [08:42:27] Hm [08:43:01] well there are several in a row, I think it is either report updater or a script [08:43:12] I think it's reportupdater [08:43:19] You should move ahead elukey [08:45:04] brutally stopping hive? [08:45:19] just stopping hive [08:45:20] wait no more queries [08:45:23] go gogogoggo [08:45:27] :) [08:45:45] I mean, brutally is usually not a good solution - Except when ottomata uses his hammer [08:47:12] will leave the appmaster to complete and then I'll stop [08:47:29] druid will also doesn't like this due to the db [08:47:43] but there shouldn't be any indexation ongoing so not a big deal [08:47:50] elukey: correct [08:49:54] aaand rebooting [08:50:44] otto - camus-webrequest_canary_test ? [08:51:04] ah no appeared only for a brief moment [08:51:07] maybe a old cron? [08:54:17] anyhow, an1003 up and running [08:54:31] !log re-enable camus after reboots [08:54:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:56:00] Hadoop reboots completed :) [08:56:10] Many thanks elukey :) [08:57:55] 10Analytics-Kanban, 10User-Elukey: Reboot all Analytics hosts for Kernel upgrade - https://phabricator.wikimedia.org/T188594#4034363 (10elukey) [09:37:11] so I checked the logs for an1070 [09:37:23] HDFS datanode looks good, nothing weird in there afaics [09:37:37] on the nodemanager the only weird thing is a recurrent [09:37:38] WARN org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: Unexpected: procfs stat file is not in the expected format for process with pid 43846 [09:37:53] that seems to be https://issues.apache.org/jira/browse/YARN-3344?devStatusDetailDialog=repository [09:39:32] elukey: is that --^ a show-stopper? [09:40:04] joal: I am trying to figure it out, it seems more an annoyance than a real issue [09:40:26] From the issdue description, seems to have no functional issue except log spamming [09:45:22] https://github.com/apache/hadoop/blob/branch-2.6.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java#L514 [09:49:23] so the comment before says // Set (name) (ppid) (pgrpId) (session) (utime) (stime) (vsize) (rss) [09:50:58] BUT [09:50:59] https://gerrit.wikimedia.org/r/#/c/395923/2/templates/hadoop/yarn-site.xml.erb [09:52:01] so reading the code it seems that we are good [09:52:28] https://github.com/apache/hadoop/blob/branch-2.6.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java#L265 [09:52:42] joal: what do you think? [09:52:52] elukey: I actually don't know ! [09:57:43] so the current spam is about /proc/pid/stat, and the only valuable thing in there in my opinion is the rss [09:58:01] but, due to Erik's change, we use smaps [10:34:48] 10Analytics: Intervals for data arround pageviews in wikistats maps - https://phabricator.wikimedia.org/T188928#4024359 (10mforns) @Nuria I'm considering this task for the GSoC, but I don't completely understand the title. Is is about adding a time interval selector (like a slider) for the map chart? [10:43:55] (03PS6) 10Joal: [WIP] Add by-wiki stats to MediawikiHistory job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) [10:45:45] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add by-wiki stats to MediawikiHistory job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) (owner: 10Joal) [10:47:14] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4034668 (10elukey) I checked the HDFS datanode logs and everything looks good on analytics1070. The only "weird" log is for Yarn, name... [10:50:23] elukey: does that --^ mean we're gonna add an1071? [10:50:27] :D [10:54:41] 10Analytics-Kanban, 10Analytics-Wikistats: Change '--' to something more helpful in Wikistats page views by coutry table view - https://phabricator.wikimedia.org/T187427#4034681 (10mforns) [10:56:28] joal: I am more on the +1 side yes, but probably better to way to andrew since we are not in a hurry [10:57:01] 10Analytics-Kanban, 10Analytics-Wikistats: Change '--' to something more helpful in Wikistats page views by coutry table view - https://phabricator.wikimedia.org/T187427#3974857 (10mforns) I changed the title of this task and removed part of its description. The zoom limits are already in place. [10:57:51] +1 elukey :) [10:58:01] elukey: There actually is something to update I think [10:58:18] elukey: an1070 got added in default-rack for HDFS [10:58:19] 10Analytics, 10Analytics-Wikistats: Limit pan in Wikistats2 maps - https://phabricator.wikimedia.org/T189195#4034694 (10mforns) [10:59:56] joal: nope [10:59:57] Rack: /eqiad/A/4 10.64.5.24:50010 (analytics1070.eqiad.wmnet) [11:00:11] we have an alarm for it now [11:00:20] it would have sent a icinga msg in here otherwise [11:00:40] elukey: Something has changed since me checking yesterday :) [11:00:42] where did you see the default rack? [11:00:51] Sounds great [11:01:03] yes Andrew updated the Python script with racks etc.. [11:01:34] You guys rock :) [11:08:27] I had a look at the icinga alerts for notebook1003/1004, that seems like an incomplete deployment, the systemd unit is trying start a script under /srv/jupyterhub which doesn't exist [11:09:25] moritzm: very probable - I hink Andrew started to work on this yesterday - He probably didn't finish [11:09:58] moritzm: yep they are still wip afaik [11:13:14] ok, I mostly had a look since I was worried that I broke them during my reboots yesterday :-9 [11:16:38] 10Analytics, 10Analytics-Wikistats: Use line charts when breaking down a column chart in Wikistats2 - https://phabricator.wikimedia.org/T189200#4034779 (10mforns) [11:20:19] joal: an1071 deployed :) [11:20:24] Yay ! [11:20:53] elukey: already loaded :) [11:25:19] checked df -h across all the hadoop workers, now we are at ~90% but not worse [11:25:22] so it is an improvement :) [11:25:36] (yesterday an1036 was saturated) [11:25:47] I think it'll take some time to distribute over new nodes [11:26:38] elukey: question for you - How can I find a list of prometheus metrics / format available for grafan? [11:29:27] 10Analytics, 10Analytics-Wikistats: Beta: How to link to/cite Wikistats 2.0 - https://phabricator.wikimedia.org/T179975#4034810 (10mforns) [11:29:31] 10Analytics, 10Analytics-Wikistats: Wikistats Bug – Put view settings in URL so it can be shared - https://phabricator.wikimedia.org/T179444#4034813 (10mforns) [11:29:41] 10Analytics, 10Analytics-Wikistats: Breakdowns should be sticky on url such you can bookmark them - https://phabricator.wikimedia.org/T184136#4034815 (10mforns) [11:29:43] 10Analytics, 10Analytics-Wikistats: Wikistats Bug – Put view settings in URL so it can be shared - https://phabricator.wikimedia.org/T179444#3724954 (10mforns) [11:30:22] 10Analytics, 10Analytics-Wikistats: Wikistats Bug – Put view settings in URL so it can be shared - https://phabricator.wikimedia.org/T179444#3724954 (10mforns) From duplicate task: > I'd like to recommend that we make the deeplink to each report discoverable and shareable (if it's persistent). For example, at... [11:32:47] joal: there is auto-completion for metrics names when you add them in grafana, but I am afraid that there is not complete list except from that [11:32:51] do you have anything in mind? [11:33:12] elukey: was willing to see cluser usage (RAM + Cores) - I'll use auto-completion :) [11:33:16] Thanks :) [12:00:11] joal: I'd be ready to flip eventlogging to 1002 [12:01:00] elukey: Do you wish me to check something specifically ? [12:01:56] joal: if you have a bit of time I'd really love an extra pair of eyes :) [12:02:18] elukey: I always make time for these activity :) [12:02:23] \o/ [12:02:32] elukey: Checking on grafana? [12:02:37] so the procedure that I'd like to follow is: [12:02:59] 1) stop all the daemons on 1001 except the zmq forwarder (still waiting for performance to migrate away coal..) [12:03:12] this should allow all consumer group to commit their offset etc.. [12:03:35] 2) apply the eventlogging analytics role to eventlog1002 and run puppet on it [12:03:55] now that I think about it, do I need to deploy EL to 1002 first? [12:04:55] elukey: I can't say :( [12:06:11] * elukey goes on tin [12:11:13] ah yes there is a separate repo for scap config [12:14:58] joal: https://gerrit.wikimedia.org/r/#/c/417244/ [12:16:02] elukey: shouldn't we always deploy to both el1001 and el1002? [12:16:51] joal: well eventlog1001 will be deprecated as soon as eventlog1002 is up (and coal is migrated, but there is no plan to work on zmq) [12:17:55] elukey: ok then [12:22:37] all right in theory puppet should download everything when it runs the first time [12:23:08] elukey: Just added a row to Hadoop Prometheus dashboard [12:23:13] At the bottom [12:23:14] goood [12:23:27] Interesting finding on compute usage [12:23:36] 10Analytics, 10Analytics-Wikistats: Address design feedback from Volker - https://phabricator.wikimedia.org/T167673#4034929 (10mforns) [12:23:39] 10Analytics, 10Analytics-Wikistats, 10Accessibility, 10Easy: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#4034931 (10mforns) [12:24:06] 10Analytics, 10Analytics-Wikistats, 10Accessibility, 10Easy: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#3918835 (10mforns) From duplicate task: > https://www.mediawiki.org/wiki/Topic:Tp059xh118wkx24b > > Also, use this to clean up other th... [12:24:29] arghhhh [12:24:30] /dev/md0 9.1G 5.7G 3.0G 66% / [12:24:39] this is on 1002 [12:24:47] definitely a no-go [12:24:53] :( [12:24:56] I'll probably need to reimage [12:25:00] mwarf [12:25:50] joal: I get some errors while opening the dashboard [12:26:11] "invalid dimensions to plot etc.." [12:26:22] :( [12:26:38] you don't get them I suppose [12:26:44] * elukey cries [12:29:53] elukey: when in non-admin mode - Yes I do have them [12:30:18] lol [12:30:20] this is new [12:30:43] I also have them in admin though [12:30:50] elukey: I think I understand [12:30:56] ah no wait, if I edit I can see [12:31:48] elukey: Solved ! [12:32:01] \o/ [12:32:47] Compute usage is awesome [12:33:10] elukey: tells us that MW-history is RAM consuming !!! [12:33:27] And that our default MEM/CORES ration is good :) [12:38:53] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4034945 (10jcrespo) [12:39:11] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4034956 (10jcrespo) p:05Triage>03High [12:40:23] 10Analytics, 10Analytics-Wikistats: Make the colors used the line charts in Wikistats 2 more easy to recognize. - https://phabricator.wikimedia.org/T183184#3846128 (10mforns) A related thing: Maybe we could have single-line charts could have the area below the line colored? It's totally a subjective personal p... [12:50:06] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3922494 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['eventlog1002.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage... [12:54:50] 10Analytics, 10Analytics-Wikistats: Beta: Y-axis units and rounding issues - https://phabricator.wikimedia.org/T187429#3974963 (10mforns) Regarding nr. 3 I totally agree with @Volans that the popup box is unnecessarily colliding with the line and makes it unconfortable to read its values. However, I am strong... [13:05:04] 10Analytics, 10Analytics-Wikistats: Beta: Pageviews by Country Monthly should specify month in question - https://phabricator.wikimedia.org/T187389#4034982 (10mforns) [13:05:06] 10Analytics-Kanban, 10Analytics-Wikistats, 10Easy: [Wikistats2] The detail page for tops metrics does not indicate time range - https://phabricator.wikimedia.org/T182990#4034985 (10mforns) [13:05:35] 10Analytics-Kanban, 10Analytics-Wikistats, 10Easy: [Wikistats2] The detail page for tops metrics does not indicate time range - https://phabricator.wikimedia.org/T182990#3840528 (10mforns) From duplicate task: > "Pageviews by Country Monthly" should specify month in question > > Right now on the map for pa... [13:06:11] 10Analytics-Kanban, 10Analytics-Wikistats, 10Easy: [Wikistats2] The detail page for tops and maps metrics does not indicate time range - https://phabricator.wikimedia.org/T182990#4034987 (10mforns) [13:20:16] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#4035006 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['eventlog1002.eqiad.wmnet'] ``` and were **ALL** successful. [13:23:30] joal: ready to switch [13:23:43] Ok !! [13:27:19] ok eventlog1001 stopped [13:29:21] and now going to apply the role to 1002 [13:33:12] I think it worked [13:39:50] * elukey dances [13:40:00] \o/ [13:40:15] elukey: charts look good [13:40:23] * joal dances with elukey now :) [13:40:31] woooooooooooooooooooooooowwwwwwwwwwwwwwwwwwwwwwww [13:43:50] joal: would you like to run sudo eventloggingctl status on eventlog1002 ? [13:44:00] just to see if it works properly [13:45:55] * elukey hugs joal [13:46:11] elukey: I'm assuming the last line about eventlogging.service being exited means normal, right? [13:46:29] All other lines look good (running everywhere) [13:47:06] ah this is the new thing :) [13:47:19] so eventlogging.service is a "dummy" systemd unit [13:47:41] 12 processors, 1 MySQL consumer m4, 1 MySQL eventbus, 1 client-side log consumer, 1 all-events log consumer [13:47:46] Ahhhh [13:47:48] so when you do systemctl stop eventlogging all the daemons will stop [13:47:59] Therefore, it says it itself is stopped [13:48:13] bu other daemons are running [13:48:38] nono it is up [13:49:14] ah it says "exited" in the status [13:49:23] right [13:49:38] but if you do systemctl status eventlogging it shows up as active [13:49:51] need to adjust what to visualize in eventloggingctl [13:50:16] since it is a dummy service it can't be running [13:50:38] joal: try to do sudo systemctl cat eventlogging [13:50:56] basically all it does is "ExecStart=/bin/true" [13:51:13] :) [13:56:20] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#4035038 (10elukey) [13:56:29] 10Analytics, 10Operations, 10hardware-requests: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. - https://phabricator.wikimedia.org/T184551#4035040 (10elukey) [13:56:32] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade eventlogging servers to Stretch - https://phabricator.wikimedia.org/T114199#4035041 (10elukey) [13:56:36] 10Analytics, 10Operations: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3922494 (10elukey) 05Open>03Resolved [14:01:10] sent an email to analytics@ [14:16:37] a-team: everything looks good, eventlog1002 is the new EL host :) [14:16:41] \o/ [14:24:44] I also added the committed offset to https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag [14:24:58] the eventlogging ones look steady [14:32:29] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade eventlogging servers to Stretch - https://phabricator.wikimedia.org/T114199#4035142 (10elukey) So current situation: 1) on eventlog1002 all daemons but zmq-forwarder are running fine (stretch/systemd) 2... [14:32:38] ottomata: morningggg [14:35:23] elukey et al, kudooooossss! [14:35:28] \o/ [14:37:20] all right going to take my lunch break now :) [14:37:30] + a little errand [14:37:57] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4035154 (10Ottomata) +1 elukey! But, isn't https://gerrit.wikimedia.org/r/#/c/395923/ already merged? [14:38:00] ping me if needed! [14:38:02] elukey: hiii [14:38:15] if you like when you get back let's do eventlog1002? [14:38:23] oh i have 1:1 with nuria at 11 [14:38:30] ottomata: it is already done! [14:39:01] on eventlog1001 it is now running only the zmq-forwarder [14:39:16] OH! [14:39:18] OOOOOO [14:39:18] COOL [14:39:59] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4035162 (10elukey) >>! In T188294#4035154, @Ottomata wrote: > +1 elukey! But, isn't https://gerrit.wikimedia.org/r/#/c/395923/ alread... [14:40:36] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4035163 (10Ottomata) AH hm ok! [14:41:20] ottomata: also an1071 is serving traffic [14:41:31] going afk for a bit (lunch + errand) [14:43:07] yeehaw [14:43:21] joal: heyyyaaa [14:48:05] (03CR) 10Ottomata: Add EL and whitelist sanitization (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [14:48:46] mforns: heyy, yt? [15:05:48] 10Analytics, 10Google-Summer-of-Code (2018): Analytics GSoC - https://phabricator.wikimedia.org/T189210#4035243 (10mforns) [15:13:48] 10Analytics, 10Analytics-Wikistats: Use line charts when breaking down a column chart in Wikistats2 - https://phabricator.wikimedia.org/T189200#4035268 (10mforns) [15:13:50] 10Analytics, 10Analytics-Wikistats: Limit pan in Wikistats2 maps - https://phabricator.wikimedia.org/T189195#4035270 (10mforns) [15:13:52] 10Analytics-Kanban, 10Analytics-Wikistats: The alert message about adblocker is not fully shown on smaller screens - https://phabricator.wikimedia.org/T188208#4035271 (10mforns) [15:13:54] 10Analytics, 10Analytics-Wikistats: Beta: Y-axis units and rounding issues - https://phabricator.wikimedia.org/T187429#4035272 (10mforns) [15:13:56] 10Analytics-Kanban, 10Analytics-Wikistats: Change '--' to something more helpful in Wikistats page views by coutry table view - https://phabricator.wikimedia.org/T187427#4035273 (10mforns) [15:13:58] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistat Beta: expand topic explorer by default - https://phabricator.wikimedia.org/T186335#4035274 (10mforns) [15:14:00] 10Analytics, 10Analytics-Wikistats: Display of radio buttons in Wikistats 2 is somewhat confusing - https://phabricator.wikimedia.org/T183185#4035276 (10mforns) [15:14:03] 10Analytics, 10Analytics-Wikistats: Make the colors used the line charts in Wikistats 2 more easy to recognize. - https://phabricator.wikimedia.org/T183184#4035277 (10mforns) [15:14:05] 10Analytics, 10Analytics-Wikistats, 10Accessibility, 10Easy: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#4035275 (10mforns) [15:14:09] 10Analytics, 10Analytics-Wikistats: Present Wikistats 2 charts for the period selected by the user. - https://phabricator.wikimedia.org/T183183#4035278 (10mforns) [15:14:14] 10Analytics, 10Analytics-Wikistats: Beta Release: Remaining UI advice from Erik - https://phabricator.wikimedia.org/T182109#4035281 (10mforns) [15:14:16] 10Analytics-Kanban, 10Analytics-Wikistats, 10Easy: [Wikistats2] The detail page for tops and maps metrics does not indicate time range - https://phabricator.wikimedia.org/T182990#4035280 (10mforns) [15:14:18] 10Analytics, 10Analytics-Wikistats: Consistently preserve settings when a user switches to a new metric (especially on the same page). - https://phabricator.wikimedia.org/T183181#4035279 (10mforns) [15:14:20] 10Analytics, 10Analytics-Wikistats: Make Wikistats data easily embedable on-wiki - https://phabricator.wikimedia.org/T178016#4035284 (10mforns) [15:14:22] 10Analytics, 10Analytics-Wikistats: Consider adding breadcrumbs to Wikistats 2 - https://phabricator.wikimedia.org/T178018#4035283 (10mforns) [15:14:24] 10Analytics, 10Analytics-Wikistats: Wikistats Bug – Put view settings in URL so it can be shared - https://phabricator.wikimedia.org/T179444#4035282 (10mforns) [15:14:26] 10Analytics, 10Google-Summer-of-Code (2018): Analytics GSoC - https://phabricator.wikimedia.org/T189210#4035267 (10mforns) [15:15:24] (03CR) 10Mforns: Add EL and whitelist sanitization (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [15:24:55] sorry for the pollution folks :[ [15:30:08] 10Analytics, 10Analytics-Wikistats: Wikistats2 line chart and map displacement bugs in Chrome+Ubuntu - https://phabricator.wikimedia.org/T189197#4035329 (10Aklapper) [15:30:27] 10Analytics, 10DBA, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4035332 (10mobrovac) After [lowering the concurrency](https://gerrit.wikimedia.org/r/#/c/417270/) the number of new connections slightly... [15:37:45] mforns: what you think about getting that sanitization thing merged soon? [15:37:58] i'd like to deploy refinery and get these refine job changes in [15:38:00] maybe on monday [15:38:02] ottomata, I'm working on it, refactoring the unit tests right now [15:38:06] ok awesome [15:38:14] k [15:38:29] also will address your comment [15:38:38] cool [15:38:50] jon robson was pinging me about when we could get the geocode etc. stuff deployed [15:39:02] woudl be nice to deploy refinery with your santize thing too [15:43:18] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#4035357 (10Ottomata) [15:54:04] 10Analytics: Intervals for data arround pageviews in wikistats maps - https://phabricator.wikimedia.org/T188928#4035381 (10Nuria) [15:54:17] 10Analytics: Intervals for data arround pageviews in wikistats maps - https://phabricator.wikimedia.org/T188928#4024359 (10Nuria) @mforns reworded title [15:55:30] thx! [16:03:27] 10Analytics-Kanban, 10Operations, 10ops-eqiad: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4035402 (10Cmjohnson) @elukey Did a check again today, the errors have not come back. Do you want to put back in production? [16:10:52] elukey: joal was right, it was reportupdater query [16:18:38] bearloga: :) [16:18:47] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4035491 (10Ottomata) [16:18:48] hope that I didn't break anything of yours [16:19:11] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#3849554 (10Ottomata) [16:20:45] ottomata: I've added the commit offset to the lag dashboard (https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=eqiad&var-topic=All&var-consumer_group=All) [16:21:01] so we can track new/old cgroups [16:22:13] elukey: hm cool [16:22:24] why is the committed offset useful if you have the lag/ [16:24:23] ottomata: this morning I was chatting with d*causse and he said that mjolnir was not yet consuming but already migrated, lag was 0 but it was not clear if the client was consuming or not [16:24:27] so I thought to add it [16:25:39] mforns: Super thanks for cleaning up all wikistats tags [16:25:45] *tasks [16:26:08] ah ok interesting [16:26:13] nuria_, np, maybe in grosking today we can decide if some of them are too urgent to wait until GSoC [16:29:20] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4035522 (10Ottomata) [16:39:57] 10Analytics-Tech-community-metrics, 10Developer-Relations: Review entries in https://github.com/Bitergia/mediawiki-repositories/ to exclude/include - https://phabricator.wikimedia.org/T187711#4035576 (10Aklapper) Well, https://github.com/Bitergia/mediawiki-repositories/ is not used anymore in our stack so the... [16:40:00] AHHH CRAP elukey [16:40:09] there is a bug that got lost in webrequest cache text refactoring [16:40:41] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/cache/kafka/webrequest.pp#L118-L123 [16:40:53] so, all vks currently set protocol version to 0.9.0.1 [16:40:55] yarrrggh [16:41:01] i mean, its ok, but we should change it [16:41:09] so, not done! going to make a patch and test on canary [17:00:20] (still in cassandra standup, I'll be a bit late for standup sorry! [17:01:06] k! [17:07:52] ottomata: ah snap you are right! [17:13:51] ottomata: Super weird error in yarn: [17:13:52] java.lang.IllegalArgumentException: Required executor memory is above the max threshold (8192 MB) of this cluster! [17:14:08] ottomata: have we changed anything? [17:14:24] maybe some conf in new nodes is not having correct values for some params? [17:14:30] there was a patch yesterday that should have increased max memory for a the new node nodemanagers... [17:15:31] https://gerrit.wikimedia.org/r/#/c/416965/ [17:16:21] joal: do you know what node that is on? [17:16:37] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 consumer/mysql-eventbus consumer/client-side-events-log consumer/all-events-log processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/client-side-08 processor/client-side-07 processor/client-side-06 processor/client-side-05 processor/client-side-04 proces [17:16:37] processor/client-side-02 processor/client-side-01 processor/client-side-00 [17:16:52] uh oh elukey^ [17:17:04] lovely [17:17:13] hm, eventlogging ctl says running [17:17:35] ottomata: I ran from stat1004 [17:17:46] oh it didn't get launched joal? [17:17:49] nope [17:18:36] so that thing does /usr/lib/nagios/plugins/check_eventlogging_jobs [17:19:14] that I am seeing it now for the first time :D [17:19:25] /sbin/status -q "eventlogging/${role}" NAME="${name}" [17:19:40] oh yeah, looks pretty upstarty to me! [17:19:43] yeah! [17:19:49] but it looks working fine right? [17:19:56] I mean eventlogging [17:19:58] elukey: probably would be find to just put monitor::service things into the defines [17:20:19] yep yep [17:20:23] ya [17:20:27] ha [17:20:28] ls /sbin/status [17:20:28] ls: cannot access '/sbin/status': No such file or directory [17:21:13] for a moment my hearth stopped [17:21:16] now I feel better :D [17:21:23] :D [17:21:24] all right will fix after grooming then [17:40:25] elukey: FYI, i'm half reverting that $reserved_memory_mb thing i did yesterday [17:40:35] and just varying the value of yarn_nodemanager_resource_memory_mb per node by regex [17:40:42] since we really only need to vary that one value [17:44:41] ottomata: ack! [17:46:00] ping joal [17:46:05] ping ottomata [17:46:09] grosking? [17:46:29] nuria_: will be there in a minute [17:51:33] 10Analytics, 10Analytics-Wikistats: Beta: Y-axis units and rounding issues - https://phabricator.wikimedia.org/T187429#4035734 (10Nuria) Regarding 1: Localization for y-axis values (for locales supported by numeral.js is done now) +1 to graphs starting at zero [17:52:09] huh, joal i guess we also need yarn.nodemanager.resource.memory-mb on clients [17:52:09] strange! [17:52:24] I don't think ottomata [17:52:42] Now spark fails, but after launch [17:52:53] Probably meaning some nodes don't have the setting yet [17:53:08] I'll wait some time for puppet to have run everywhere [17:53:13] hm, no they all had the setting execpt for the client [17:53:27] now the client has the scheduler one, but not the resource one [17:53:38] 10Analytics, 10Analytics-Wikistats: Use line charts when breaking down a column chart in Wikistats2 - https://phabricator.wikimedia.org/T189200#4035737 (10mforns) [17:53:41] 10Analytics, 10Google-Summer-of-Code (2018): Analytics GSoC - https://phabricator.wikimedia.org/T189210#4035736 (10mforns) [17:53:45] hm [17:53:51] Maybe then - weird [17:54:09] If we give a value for he client, it should be the samllest server one [17:54:13] but that feels weird [17:54:35] yeah, will manually edit real quick to confirm thats the rpoblem [17:54:39] k [17:54:59] ottomata: If the max-mem is set on every node of the cluster, then yeah [17:55:55] hm ok i just edited yarn site on the cluster, same thing [17:55:56] sorry [17:55:58] yar-site on stat1004 [17:56:21] soooo ... IMO the max-value needs to be set to he same value accross the cluster, on all nodes [17:56:26] And on clients [17:56:48] The nodemanager should only be needed on nodemanagers I assume [17:58:07] i would assume too [17:58:31] yarn.nodemanager.resource.memory-mb is set on all workers, and value varies based on RAM. this is a no-op since yesterday [17:58:46] yarn.scheduler.maximum-allocation-mb was the same as ^ yesterday [17:58:49] but the change I just made [17:58:59] set it globally for all hadoop nodes, including clients, to yarn.scheduler.maximum-allocation-mb [17:59:00] OH [17:59:01] but [17:59:06] maybe it needs to be set on resourcemanager [17:59:08] i bet that's it... [17:59:12] oh, no [17:59:12] it hsould be [17:59:15] its on all nodes [17:59:18] 10Analytics, 10Analytics-Wikistats: Present Wikistats 2 charts for the period selected by the user. - https://phabricator.wikimedia.org/T183183#4035755 (10mforns) [17:59:22] 10Analytics, 10Analytics-Wikistats: Consistently preserve settings when a user switches to a new metric (especially on the same page). - https://phabricator.wikimedia.org/T183181#4035756 (10mforns) [17:59:22] 10Analytics, 10Analytics-Wikistats: Wikistats Bug – Put view settings in URL so it can be shared - https://phabricator.wikimedia.org/T179444#4035757 (10mforns) [17:59:24] 10Analytics, 10Analytics-Wikistats: Make Wikistats data easily embedable on-wiki - https://phabricator.wikimedia.org/T178016#4035758 (10mforns) [17:59:25] but yarn.nodemanager.resource.memory is only set on workers [17:59:26] 10Analytics, 10Google-Summer-of-Code (2018): Analytics GSoC - https://phabricator.wikimedia.org/T189210#4035754 (10mforns) [17:59:55] hm, but i didn't bounce resourcemanager yesterday at all [18:00:36] That's super unexpected ottomata :( [18:00:51] joal https://yarn.wikimedia.org/conf [18:00:54] yarn.scheduler.maximum-allocation-mb [18:00:54] 8192 [18:00:54] yarn-default.xml will set it globally i guess and see [18:01:18] oh! did elukey reboot namenodes today? [18:01:21] How the heck did it set this way ?? [18:01:22] i thought i maybe heard something about that [18:01:33] Yes sir, rebooted this morning I think [18:02:07] 10Analytics-Kanban, 10Google-Summer-of-Code (2018): Analytics GSoC - https://phabricator.wikimedia.org/T189210#4035766 (10fdans) [18:02:50] yes confirmed [18:03:03] ok ,ya so its probably there, the resourcemanager needs puppet and probably a bounce [18:03:07] checking... [18:03:08] yeah [18:03:12] + [18:03:13] + yarn.scheduler.maximum-allocation-mb [18:03:13] + 53248 [18:03:13] + [18:03:13] on rm [18:04:15] I don't understand ottomata :( [18:04:39] joal: so yarn.scheduler.maximum-allocation-mb also needs to be set on the RM [18:04:41] i think [18:04:43] 10Analytics, 10Analytics-Wikistats: Change '--' to something more helpful in Wikistats page views by coutry table view - https://phabricator.wikimedia.org/T187427#4035771 (10mforns) [18:04:50] as well as on clients [18:04:54] Agreed - And yesterday's path did taht? [18:05:02] yesterday's patch removed it [18:05:05] the thing I did just now added it [18:05:08] but we need to bounce RMs [18:05:08] ok - I understand now [18:05:09] donig that now [18:05:16] 10Analytics, 10Analytics-Wikistats: The alert message about adblocker is not fully shown on smaller screens - https://phabricator.wikimedia.org/T188208#4035773 (10mforns) [18:05:31] !log bouncing ResourceManagers [18:05:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:06:59] joal: try now [18:07:05] SUCCESS ! [18:07:09] Thanks a mil ottomata [18:07:24] 10Analytics, 10Analytics-Wikistats, 10Easy: [Wikistats2] The detail page for tops and maps metrics does not indicate time range - https://phabricator.wikimedia.org/T182990#4035789 (10mforns) [18:07:54] 10Analytics, 10Analytics-Dashiki, 10Easy: Add annotationsMetric option to tabs layout - https://phabricator.wikimedia.org/T189159#4035793 (10fdans) [18:07:57] ottomata: thanks a lot! Sorry that I caused this :( [18:08:15] you did not cause! [18:08:15] haha [18:08:15] i cuased! [18:08:18] bad onfigs [18:08:20] configs* [18:08:42] I haven't reviewed it properly + bounced the rs managers today without checking, also my fault :) [18:08:43] 10Analytics, 10Analytics-Wikistats: Wikistat Beta: expand topic explorer by default - https://phabricator.wikimedia.org/T186335#4035803 (10mforns) [18:08:48] will be more careful next time [18:09:05] Thanks for fast fix anohow :) [18:10:40] 10Analytics, 10Analytics-Wikistats: Wikistat Beta: expand topic explorer by default - https://phabricator.wikimedia.org/T186335#4035813 (10mforns) a:05Nuria>03None [18:13:04] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. [18:17:13] 10Analytics, 10Analytics-Wikistats: Beta: Y-axis units and rounding issues - https://phabricator.wikimedia.org/T187429#4035835 (10mforns) p:05Normal>03Triage [18:19:33] ottomata: I fixed the el alarm with https://gerrit.wikimedia.org/r/#/c/417317/ [18:19:55] a bit hacky but I am planning to do proper alarming after upstart code cleanup [18:23:12] saw it, +1 [18:23:54] 10Analytics, 10Analytics-Kanban: Intervals for data arround pageviews in wikistats maps - https://phabricator.wikimedia.org/T188928#4035850 (10fdans) [18:24:18] 10Analytics: Intervals for data arround pageviews in wikistats maps - https://phabricator.wikimedia.org/T188928#4024359 (10fdans) [18:29:31] ottomata: https://gerrit.wikimedia.org/r/#/c/417322/ - filenames should be different so I am not seeing any issue, wdyt? [18:30:16] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Sanitize Hive EventLogging - https://phabricator.wikimedia.org/T181064#4035876 (10fdans) [18:30:20] 10Analytics: Partially purge user-agent map for EventLogging mobile schemas - https://phabricator.wikimedia.org/T178198#4035878 (10fdans) [18:30:35] 10Analytics: Partially purge user-agent map for EventLogging mobile schemas - https://phabricator.wikimedia.org/T178198#3684282 (10fdans) This will only be implemented in Hadoop [18:31:33] +1 elukey ! [18:32:13] maybe force a manual rsync now with eventlog1001, and then merge the patch [18:32:16] just to be sure [18:32:48] +1 [18:38:01] 10Analytics, 10Analytics-Kanban: Update UA parser - https://phabricator.wikimedia.org/T189230#4035939 (10Nuria) [18:38:16] 10Analytics: Update UA parser - https://phabricator.wikimedia.org/T189230#4035950 (10Nuria) [18:38:18] 10Analytics, 10Google-Summer-of-Code (2018): Analytics GSoC - https://phabricator.wikimedia.org/T189210#4035951 (10Nuria) [18:38:20] 10Analytics: Private geo wiki data in new analytics stack - https://phabricator.wikimedia.org/T176996#4035953 (10Nuria) [18:38:26] 10Analytics, 10Analytics-Wikistats: Wikistats Beta - https://phabricator.wikimedia.org/T186120#4035952 (10Nuria) [18:38:28] 10Analytics, 10Analytics-Wikistats: Make the Wikistats 2 UI responsive - https://phabricator.wikimedia.org/T186812#4035956 (10Nuria) [18:38:30] 10Analytics, 10Patch-For-Review, 10User-Elukey: Move away from jmxtrans in favor of prometheus jmx_exporter - https://phabricator.wikimedia.org/T175344#4035958 (10Nuria) [18:38:32] 10Analytics, 10Patch-For-Review, 10User-Elukey: Move away from jmxtrans in favor of prometheus jmx_exporter - https://phabricator.wikimedia.org/T175344#3590888 (10Nuria) [18:38:34] 10Analytics, 10Analytics-Wikistats: Create Daily & Monthly pageview dump with country data and Visualize on UI - https://phabricator.wikimedia.org/T90759#1066753 (10Nuria) [18:38:39] 10Analytics, 10Analytics-Wikistats: Create Daily & Monthly pageview dump with country data and Visualize on UI - https://phabricator.wikimedia.org/T90759#4035962 (10Nuria) [18:38:40] 10Analytics: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280#4035966 (10Nuria) [18:38:42] 10Analytics, 10Analytics-EventLogging: Sunset MySQL data store for eventlogging. Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB - https://phabricator.wikimedia.org/T159170#4035968 (10Nuria) [18:38:44] 10Analytics, 10Analytics-EventLogging: Sunset MySQL data store for eventlogging. Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB - https://phabricator.wikimedia.org/T159170#3058941 (10Nuria) [18:38:46] 10Analytics: Per Family Unique Devices Counts - https://phabricator.wikimedia.org/T143927#4035972 (10Nuria) [18:38:48] 10Analytics: Webrequest tagging and distribution. Measuring non-pageview requests - https://phabricator.wikimedia.org/T164019#4035974 (10Nuria) [18:38:52] 10Analytics: Measure Community Backlog. - https://phabricator.wikimedia.org/T155497#4035975 (10Nuria) [18:38:54] 10Analytics: Measure Community Backlog. - https://phabricator.wikimedia.org/T155497#2945157 (10Nuria) [18:38:57] 10Analytics, 10Cloud-Services: Provide mediawiki history data to Cloud Services users - https://phabricator.wikimedia.org/T169572#4035981 (10Nuria) [18:38:59] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Provision new Kafka cluster(s) with security features - https://phabricator.wikimedia.org/T152015#4035979 (10Nuria) [18:39:01] 10Analytics: Sanitize pageview_hourly - subtasked {mole} - https://phabricator.wikimedia.org/T114675#4035986 (10Nuria) [18:39:03] 10Analytics, 10Analytics-Wikistats: Wikistats 2.0. - https://phabricator.wikimedia.org/T130256#2131074 (10Nuria) [18:39:05] 10Analytics, 10Analytics-Wikistats: Wikistats 2.0. - https://phabricator.wikimedia.org/T130256#4035982 (10Nuria) [18:39:54] 10Analytics-Kanban: Sanitize pageview_hourly - subtasked {mole} - https://phabricator.wikimedia.org/T114675#1702699 (10Nuria) [18:39:57] 10Analytics-Kanban, 10Cloud-Services: Provide mediawiki history data to Cloud Services users - https://phabricator.wikimedia.org/T169572#3402184 (10Nuria) [18:39:59] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2.0. - https://phabricator.wikimedia.org/T130256#2131074 (10Nuria) [18:40:01] 10Analytics-Kanban: Measure Community Backlog. - https://phabricator.wikimedia.org/T155497#2945157 (10Nuria) [18:40:03] 10Analytics-Kanban: Webrequest tagging and distribution. Measuring non-pageview requests - https://phabricator.wikimedia.org/T164019#3218528 (10Nuria) [18:40:05] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Provision new Kafka cluster(s) with security features - https://phabricator.wikimedia.org/T152015#2835361 (10Nuria) [18:40:07] 10Analytics-Kanban: Per Family Unique Devices Counts - https://phabricator.wikimedia.org/T143927#2583508 (10Nuria) [18:40:09] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging. Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB - https://phabricator.wikimedia.org/T159170#3058941 (10Nuria) [18:40:11] 10Analytics-Kanban, 10Analytics-Wikistats: Create Daily & Monthly pageview dump with country data and Visualize on UI - https://phabricator.wikimedia.org/T90759#1066753 (10Nuria) [18:40:13] 10Analytics-Kanban, 10Analytics-Wikistats: Make the Wikistats 2 UI responsive - https://phabricator.wikimedia.org/T186812#3956038 (10Nuria) [18:40:15] 10Analytics-Kanban: Eventlogging of the Future - https://phabricator.wikimedia.org/T185233#3910345 (10Nuria) [18:40:17] 10Analytics-Kanban: Private geo wiki data in new analytics stack - https://phabricator.wikimedia.org/T176996#3643940 (10Nuria) [18:40:19] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Move away from jmxtrans in favor of prometheus jmx_exporter - https://phabricator.wikimedia.org/T175344#3590888 (10Nuria) [18:40:21] 10Analytics-Kanban: Update UA parser - https://phabricator.wikimedia.org/T189230#4035939 (10Nuria) [18:40:23] 10Analytics-Kanban: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280#2161972 (10Nuria) [18:40:25] 10Analytics-Kanban, 10Google-Summer-of-Code (2018): Analytics GSoC - https://phabricator.wikimedia.org/T189210#4035243 (10Nuria) [18:40:27] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Beta - https://phabricator.wikimedia.org/T186120#3934315 (10Nuria) [18:41:16] 10Analytics: Update UA parser - https://phabricator.wikimedia.org/T189230#4036008 (10Nuria) [18:46:03] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4036035 (10Ottomata) [18:46:30] elukey: if you find a sec, review the plan! :) https://phabricator.wikimedia.org/T183297 [18:46:35] if we are good, we can do this next week [18:46:37] hopefully [18:48:52] 10Analytics: Some fields in Pivot should be numbers - https://phabricator.wikimedia.org/T167494#4036063 (10Nuria) @Gilles Just FYI that filtering is not going to work fast in druid with numbers, we probably need to test whether it works with our volume. Efficient filtering requires indexes and numeric types are... [18:49:10] 10Analytics: Some fields in webrequest druid dataset should eb ingested as numbers - https://phabricator.wikimedia.org/T167494#4036064 (10Nuria) [18:49:31] 10Analytics-Kanban, 10Patch-For-Review: Remove sensitive fields from whitelist for QuickSurvey schemas (end of Q2) - https://phabricator.wikimedia.org/T174386#4036065 (10mforns) Yes, @leila, @fdans and all I totally forgot to formalize our agreement in this task. Here it goes: * **QuickSurveyInitiation**:... [18:59:05] ottomata: +1! [18:59:17] let's do it mon/tue ? [18:59:49] gtg now but I'll review the patches! [19:00:04] * elukey off! [19:16:54] ya tues les do [19:23:46] is something going on with Kafka? I am getting no events from it despite changes being in Wikidata [19:24:36] it says no events after 18:20 [19:26:02] SMalyshev: looking [19:26:15] revision-create, right? [19:26:51] well, every channel - revision-create, page-properties-change, etc. [19:26:57] mirror maker looking not good [19:26:58] https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?orgId=1&var-instance=main-eqiad_to_jumbo-eqiad [19:27:01] no events at all since 18:20 [19:27:53] yeah I see the graph cuts off at 10:20 which I imagine is the same 18:20 in UTC [19:28:47] looks like some kind of alert is needed there? [19:32:09] indeed [19:32:42] investigating [19:32:46] mirror maker is stuck, not sure why... [19:33:22] it is fine for analytics cluster, why jumbo stuck!? [19:38:16] SMalyshev: we migrated to jumbo just recently [19:38:37] ottomata: disparity of producing versions? [19:39:12] nuria_: this thing hasn't changed in a while [19:39:31] i don't totally trust this version of mirror maker, it has been occasionally flaky, one of the reasons i want to upgrade main eqiad [19:39:34] but still not sure yet [19:39:52] just bounced all mirrormakers for this [19:39:59] i think messages are going to start flowing back in [19:40:20] my vps got rebooted, lost all irc logs [19:41:29] ottomata: but I understand that hour of messages was lost? [19:41:45] no nothing lost [19:41:47] just delayed [19:41:55] 10Analytics-Kanban, 10Google-Summer-of-Code (2018): [Analytics] Improvements to Wikistats2 front-end - https://phabricator.wikimedia.org/T189210#4036227 (10srishakatux) [19:41:57] ah ok then [19:42:09] the topics you are consuming from in jumbo are replicated from the 'main' kafka clusters [19:42:13] via Kafka MirrorMaker [19:42:19] which is really just a glorified consumer -> producer [19:42:23] 10Analytics-Kanban, 10Google-Summer-of-Code (2018): [Analytics] Improvements to Wikistats2 front-end - https://phabricator.wikimedia.org/T189210#4035243 (10srishakatux) Project imported here https://www.mediawiki.org/wiki/Google_Summer_of_Code/2018#Improvements_to_Wikistats2_front-end. Feel free to make any ed... [19:42:54] SMalyshev: https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?orgId=1&panelId=5&fullscreen&var-instance=main-eqiad_to_jumbo-eqiad&from=now-1h&to=now [19:43:02] when those go to 0 it will have caught up [19:43:21] yours will probably catch up sooner [19:44:04] ottomata: I am not sure though all the messages were delivered... I get no messages at all from 18:20 to 19:37 and then I get like 500 messages and it jumps forward to 19:37... I am worried that 500 msgs is a bit too low count for a whole hour [19:44:31] its possible, but unlikely hm [19:44:33] RC poller finds much more changes in the last hour [19:44:43] SMalyshev: they may be out of order, and they may not all be caught up [19:44:47] will look into it though [19:44:55] OH [19:44:56] SMalyshev: [19:45:02] the timestamps are server receivetime [19:45:29] are you looking at content timestamps (e.g. meta.dt in the message json) [19:45:36] or from timestamps from kafka? [19:45:41] from kafka... [19:45:50] since that's what is used for offsets [19:45:58] that might be a problem [19:46:06] aye, it is an index. until we get main kafka on 1.0 [19:46:11] we can't make producers set produce timestamp on a message [19:46:16] so the fallback is server receive time [19:46:24] still doesn't explain why I got only 515 messages [19:46:45] no that doesn't you should get them all, so if i look up an offset in revison-create for timestamp 18:20 [19:46:51] and then consume til now [19:46:55] about how many messages would you expect [19:47:02] actually i can answer that myself [19:47:04] much more than 500 :) [19:49:22] ottomata: the poller doesn't use time offsets except on restart [19:49:28] it uses actual kafka poll [19:49:44] aye [19:49:54] but what I got is string of 0 messages, then I guess you did something and I got 515 messages and then again it's nothing [19:50:08] and for that 515 messages the time jumped from 18:20 to 19:37 [19:50:33] so I suspect those 515 were recent ones and the old ones are somehow not delivered [19:51:00] I switched it to RC back for now so I am not sure if it will deliver them eventually... [19:51:23] I'll just wait until it calms down and then switch back to kafka... but we need to find a fix for this [19:51:32] losing a hour of updates is not good [19:51:40] SMalyshev: i think mirror maker is having problems. there shouldnt' be any dropped messages, but i think it started sending some, and then it stopped again [19:51:52] still investigating, i totally agree [19:52:02] if there are dropped messages i would be very surprised and very upset [19:52:09] can you see the actual message timestamps? [19:52:13] in the json message? [19:52:20] those are created by mediawiki at message produce time [19:52:25] ottomata: ok, I'll wait until you come to some conclusions and then will see. I'll run it on RC for now [19:52:30] k [19:52:51] ottomata: no, I don't have the messages... though I have offsets in the logs so maybe if we look at the stream we could see it [19:54:04] ah dammit I don't have offsets printed in non-debug mode for Kafka, only when staring the poll :( I need to add this [19:54:24] Set topic eqiad.mediawiki.revision-create-0 to (timestamp=1520533230955, offset=122729245) [19:54:34] so that's where it started, but that's all I have [19:56:11] SMalyshev: when i consume starting at that offset i get 144052 messages [19:56:32] there is sitll somethign wrong [19:56:33] for sure [19:56:37] but at least more than 500 :) [19:58:50] 10Analytics, 10EventBus: Mediawiki EventBus should set meta.dt to UTC time - https://phabricator.wikimedia.org/T189243#4036334 (10Ottomata) [19:58:56] ottomata: so looks like somehow these messages arrived out of order... [19:59:24] and not just by a second but by a whole hour... [20:00:03] which is not a scenario updater is well suited to deal with [20:00:18] out of order is not unexpected, but even so they don't look that out of order to me [20:00:56] ottomata: so I wonder how then it happened that I got from 18:20 to 19:37 within 515 messages and then had nothing [20:00:59] in usual operation they should not be delayed like this, but it def happens from time to time, its replication [20:01:39] ottomata: so how much the delay can be? [20:01:49] SMalyshev: I don't know how your poller works, but maybe when it ran, replication (mirror maker) wasn't fixed yet and you only got the few messages that were there? [20:02:31] ha, SMalyshev technically it could be delayed as long as the source cluster's retention time, e.g. 7 days [20:02:32] ottomata: well, maybe, I have switched it to RC pretty soon after that, so maybe if I left it running it'd eventually get all the messages... [20:02:34] but that has never happened [20:02:44] the most things have ever been delayed are about what you saw today [20:03:04] ottomata: well, that is not what I can work with for updater :( a hour delay is not good [20:03:08] SMalyshev: we usually do get alerts for this, i don't know why this oen didn't happen [20:03:36] SMalyshev: why? because wdqs results will be stale? [20:03:40] if it's a couple of seconds it's ok, but if I don't get updates for an hour or likewise then it's not good as update source [20:03:45] ottomata: yes, exactly [20:04:10] SMalyshev: you coudl consume from main-eqiad instead of jumbo, but there's no timestamp index there yet [20:04:18] and all functionality relying on wdqs would be using old data [20:04:20] that has messages much sooner, since it isn't double replication [20:04:21] ottomata: which means I can't :) [20:04:30] and if it breaks there, other prod things break too, like change prop, etc. [20:04:47] i don't know why this one broke, but the mirror maker we are using is really old, and using an old API [20:04:49] I can't consume it without timestamp seeking... [20:04:50] and has occasionally been flaky [20:04:59] i want to upgrade main kafkas to 1.x next quarter [20:05:05] then you'd get timestamp seeking there [20:05:09] and we'd be able to use newer mirror maker [20:06:04] but, in any case, in a streaming world you have to be able to deal with delayed events. if we get 1.x everywhere, we can set produce time timestamps insetad of server receive timestamps for the index you use to offset seek [20:06:13] that would mean seeks would be accurate, even if the messages are delayed [20:06:53] well, depends on how delayed. If occasional update is delayed for a couple of secs, it's one thing, if whole thing shuts off for an hour, it's another [20:07:08] I can deal with the former but not with the latter [20:07:22] SMalyshev: even if they are delayed an hour, the next updater run will do the correct thing, right? and update the indexes properly? [20:07:33] there's not any data problems, just delay? [20:08:07] ottomata: depends... right now the data is messed up because timestamp jumped an hour forward without processing the data there... So I'll have to manuall recover it [20:08:28] in general, they should not be delayed this long. but, in the case of networking problems, or other unforseen stuff, replication can be delayed. this is true for mysql too, from which your RC api poller gets its stuff too [20:08:36] SMalyshev: ? [20:08:44] I'll look into if it's possible to mitigate this but I am not sure how, since there's nothing I can rely on if I can't rely on timestamps [20:08:50] the timestamp you are using to resume from you mean? [20:08:57] ottomata: yes [20:09:22] SMalyshev: you can rely on offsets, no? [20:09:48] i think we talked about this, but now i can't remember, why don't you use offsets to save your state like all other conusmers? [20:10:21] because it's not Kafka-specific app. So I'd have to create separate storage for Kafka for this, and keep it up to date with channels, etc. [20:10:36] and of course dump has no offsets, so I'd still have to use timestamps [20:10:52] ? [20:10:55] and RC poller has no offsets, so again I have to use timestamps [20:11:01] oh [20:11:21] SMalyshev: why not just have your consumer commit its latest offset to kafka when it is done updating? [20:11:34] ottomata: it does that [20:11:45] and when it resumes, why not just start from there? [20:12:00] because when it resumes, there's no concept of "there" [20:12:10] ? [20:12:21] the offset it last read is stored in kafka [20:12:24] you comitted it [20:12:45] I can't really know if this particular client ever talked to Kafka before and if it did, whether all offsets are really correct for all channels [20:13:01] why? can't you use a unique consumer group per consumer? [20:13:12] I have no idea whether I committed it or not [20:13:16] haha, why? [20:13:43] you have an instance of wdqs with a full copy of its db, right? [20:13:54] because again it's not Kafka-only app. Data could be loaded from dump. It could be loaded from RC poller. It could be updated by any other updater we could invent in the next year... [20:14:16] i guess, but you already have some kafka specific stuff in your poller [20:14:20] so I don't know whether Kafka state matches my DB state [20:14:37] why not just say: if I have an offset committed in Kafka, i'll start from there, otherwise i'll use timestamp form db [20:14:41] from* [20:15:12] how do I know whether I have offset committed to Kafka? And whether that offset is good anymore (it could be from before data reload, for example)? [20:15:48] before data reload? oh right because you reload the entire db from time to time [20:15:51] ? [20:15:55] yes [20:16:05] (man mirror maker is def messed up right now: https://grafana.wikimedia.org/dashboard/db/kafka-by-topic-prometheus?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kafka_jumbo&var-kafka_broker=All&var-topic=eqiad.mediawiki.revision-create) [20:16:36] SMalyshev: i'm not totally sure how to do exactly what you want, all I can say is if you want consistent consumption from kafka, offsets will be way better [20:16:41] 10Analytics, 10EventBus, 10Services (doing): Mediawiki EventBus should set meta.dt to UTC time - https://phabricator.wikimedia.org/T189243#4036417 (10mobrovac) p:05Triage>03Normal a:03mobrovac [20:16:58] ottomata: why it's so spiky? it's pretty smooth before, but now it's super-spiky [20:16:59] i guess you could use a new consumer group every time the db is reloaded? [20:17:06] yeah, it hsould be smooth [20:17:12] mirrormaker processe keep dying and getting restarted [20:17:15] still not sure why yet [20:17:27] the other mirror maker is here [20:17:27] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=now-3h&to=now&refresh=5m&orgId=1&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eqiad_mediawiki_revision-create [20:17:34] well [20:17:35] not mirror maker [20:17:37] but kafka cluster [20:17:41] that's hte intake for the analytics kafka cluster [20:17:56] the mirror maker process that is producing to jumbo keeps starting up and dying [20:18:20] SMalyshev: coudl you just use a consumer group name with the database reload timestamp in it? [20:18:23] Hi all! Any thoughts on the last comment here? https://phabricator.wikimedia.org/T181811#4025932 [20:18:40] that way whenever you reload, it would use a new consumer group, and use timestamp to start [20:18:51] because there'd be no committed offset in kafka [20:19:15] I guess probably we need more details on what and when was attempted, and what the error messages were, but offhand... can u think of nay maintenance that might have been happening around the time? [20:19:18] thanks in advance!!! [20:19:31] ottomata: well, theoretically I could... though this is kinda smelly, it's possible. I think I have dump timestamp. But it's not only fresh reload. What if I temporarily switched to RC poller because kafka was down? [20:21:01] there can be a lot of things that change DB state, and the marker I have is timestamp. But I guess when running within Kafka poller, I could persist offsets [20:21:39] I am not sure though it would help me a lot - as far as I can see now, the problem happens even without restart, just because of how messages arrive [20:22:18] so not sure persisting offsets would do me that much good... but I guess it's possible to add it. [20:22:29] ottomata: was passing by before going out and saw the mirror maker mess, do you need help? [20:22:30] at least there would be some debugging possible with having them. [20:26:48] elukey: mirror maker to jumbo is dying and lagging, not sure why at all right now [20:26:52] dunno what has changed [20:26:55] not sure how you can help tho :) [20:27:19] SMalyshev: it sounds like you want to be able to sync state from many different data sources, which is always hard and always a nightmare [20:27:20] :p [20:27:40] yes, I know :) [20:27:49] that's why I try to keep state simple [20:28:23] but I think I can at least persist the offsets, shouldn't be that hard. If nothing else, that'd help debugging. [20:29:47] elukey: i'm thinking deparating the nodes for main-eqiad -> analytics and main-eqiad -> jumbo [20:29:54] using 3 for former and 3 for latter [20:30:05] kafka1012 seems to be always getting assigned all partitions fro mirror maker [20:30:10] for both instances [20:30:19] yep seems good [20:30:20] not sure if it will help this problem, but it won't hurt. [20:32:36] AndyRussG: it is hard to understand issue, do you think you could describe it a little better? [20:32:44] AndyRussG: with more detail? [20:34:22] ottomata: ok looks like we'll leave it on RC updater till Monday, hopefully mirror maker recovers by then [20:34:34] ok [20:34:36] sorry about this SMalyshev [20:34:39] mirrormaker is worrysome for sure [20:34:47] right now I'm seeing big blank on https://grafana.wikimedia.org/dashboard/db/kafka-by-topic-prometheus?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kafka_jumbo&var-kafka_broker=All&var-topic=eqiad.mediawiki.revision-create&from=now%2Fd&to=now%2Fd btw [20:35:08] yeah, still trying... [20:35:11] ottomata: do you get any meaningful error somewhere ? [20:35:30] elukey: best i got is org.apache.kafka.clients.producer.internals.ErrorLoggingCallback - Error when sending message to topic eqiad.mediawiki.job.wikibase-addUsagesForPage with key: null, value: 17575 bytes with error: Batch Expired [20:35:41] ok, I'll wait :) [20:36:37] ottomata: I am wondering if the data itself might trigger this issue, maybe a change in the job queues? [20:37:22] elukey: i do see a jump there, that + producing all webrequet this week, kafka jumbo is doing way more [20:37:24] but [20:37:29] strange that analytics is not affecte? [20:37:30] dunno [20:37:55] ah so main-eqiad -> analytics is ok? [20:38:15] (do we still have it?) [20:39:56] it seems acting weirdly from ~15:30 UTC [20:39:57] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic-prometheus?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kafka_jumbo&var-kafka_broker=All&var-topic=eqiad.mediawiki.revision-create&from=1520517410883&to=1520553599999&panelId=6&fullscreen [20:41:09] ah lovely we don't have enough logs [20:42:17] ottomata, elukey - Let me know if there is anything you wish me to look at [20:43:21] elukey: main-eqiad -> analytics looked ok to me [20:43:36] that graph is jumbo [20:44:53] ottomata: yes, I was asking indeed if main-eqiad -> analytics was ok for the same topic [20:45:01] oh yes, it seems fine [20:45:05] which is strange [20:46:14] ok elukey, kafka 1012,1013,1014 running analytics mirror, 1020,1022,1023 running jumbo mirror [20:46:41] great [20:49:28] so far better [20:49:52] hm, elukey i do see a spike in eventbus style events today, about the same time this happened [20:50:14] i wonder if replicating all of those to two different places (+ being a kafka broker) was just too much for one node [20:50:28] i would think it wouldn't be, since until this week this node (kafka1012) was also running webrequest [20:51:17] ottomata: maybe the type of events carried was different (eventbus vs webrequest), not merely the volume [20:51:52] maybe. seems unlikely, kafka doesn't really care [20:52:19] spoke too soon elukey still happening [20:52:29] yeah :( [20:56:07] elukey: i think it correlates with adding text to jumbo [20:56:20] ottomata: yeah I was about to say that, ~15:30 [20:56:23] when you deployed [20:56:30] then the spikes started [20:56:31] on tuesday [20:56:32] right? [20:56:33] yeah [20:56:50] 14:30 i think [20:56:52] right? [20:56:54] • 14:36 ottomata: beginning migration of webrequest text varnishkafka logs from Kafka analytics to Kafka jumbo-eqiad https://phabricator.wikimedia.org/T185136 [20:57:01] wait wait [20:57:03] tuesday> [20:57:04] ? [20:57:19] nuria_: yeah I'll try to get more detials! [20:57:22] thanks :) [20:57:27] yeah tuesday was when we put webrequest_text on jumbo [20:57:40] • 14:36 ottomata: beginning migration of webrequest text varnishkafka logs from Kafka analytics to Kafka jumbo-eqiad https://phabricator.wikimedia.org/T185136 [20:57:44] oh already pasted that :p [20:57:44] I was thinking more today at 15:30 from [20:57:45] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic-prometheus?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kafka_jumbo&var-kafka_broker=All&var-topic=eqiad.mediawiki.revision-create&from=now-24h&to=now&panelId=6&fullscreen [20:57:45] and elukey [20:57:49] no ? [20:57:54] what graph are you checking? [20:58:02] https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?orgId=1&var-instance=main-eqiad_to_jumbo-eqiad&from=now-7d&to=now&panelId=10&fullscreen [20:58:42] mmmm [20:58:42] what did I deploy today? [20:59:21] https://tools.wmflabs.org/sal/log/AWIGOyBQL7tQ11ghvalx [20:59:22] also, the fact that at that time kafka1013 has some produce requests in fligth [20:59:33] ah the avro stuff? [20:59:36] yeah [20:59:39] i guess, that volume is minimal though [20:59:44] but it could be [20:59:50] there are more producers pointing at jumbo as of today [20:59:59] since probably all app servers are now producing to it [21:00:27] but you are right, that correlates too [21:00:27] hm [21:00:33] also, the fact that at that time kafka1013 has some produce requests in fligth [21:00:33] ... [21:00:42] means to me that kafka1012 MM started flapping then [21:00:45] ottomata: take a look to rtt graphs https://grafana.wikimedia.org/dashboard/db/varnishkafka?orgId=1&from=now-24h&to=now [21:00:46] after the webrequets text deloy [21:01:04] 15:30 again [21:01:06] IINTTERESTING [21:03:10] ottomata: also https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&panelId=13&fullscreen&var-server=kafka-jumbo1001&var-datasource=eqiad%20prometheus%2Fops&from=now-24h&to=now-1m [21:03:32] that makes sense [21:03:35] those are the app servers [21:04:02] elukey: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&panelId=13&fullscreen&from=now-24h&to=now-1m&var-server=kafka1012&var-datasource=eqiad%20prometheus%2Fops [21:04:14] hmm those don't add up tho, do they! [21:04:25] maybe those were kept alive and these ones are not? [21:04:45] it way more spiky [21:04:49] *is [21:05:10] should we rollback? [21:05:35] i think so, need to bring back mediawiki analytics camus job [21:05:36] on it... [21:05:58] ottomata: need to go now but if you need me call and I'll be back in 20 mins [21:06:14] ok [21:06:17] thanks elukey nice finds [21:06:43] 10Analytics-Kanban, 10Patch-For-Review: Remove sensitive fields from whitelist for QuickSurvey schemas (end of Q2) - https://phabricator.wikimedia.org/T174386#4036473 (10leila) perfect. Thanks, @mforns and team. [21:16:28] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136#4036510 (10Ottomata) Had to revert this. Somehow old Kafka client + Kafka 1.x is causing too many TCP connections, which caused a lo... [21:29:04] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136#4036601 (10Ottomata) @elukey, I do think the webrequest_text deploy on Tuesday also correlates: The MirrorMaker instance started fla... [21:31:33] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136#4036618 (10Ottomata) RTT of VK goes up when I deployed webrequest_text to Jumbo too: https://grafana.wikimedia.org/dashboard/db/varni... [21:39:36] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136#4036648 (10Ottomata) Also interesting, it looks like the PHP Kafka client causes a lot of Syn retransmits: https://grafana.wikimedia... [21:40:40] zz.wikipedia.org does not seem to be working in Wikistats v2 - is there a way to get the article count for all Wikipedias in the new Wikistats? [21:53:38] (03PS8) 10Milimetric: [WIP] Compute geowiki statistics for Druid from cu_changes data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/413265 [21:53:59] every time I write an oozie job a little part of me dies :) [21:54:48] milimetric: You understand why I'm barely alive ! [21:54:56] lol joal [21:55:29] ottomata: how is rollback going? Help needed? [21:55:59] varnent: what's zz.wikipedia? Some way to do aggregates? The new wikistats just has an "All wikis" project and we don't have all the metrics yet, but we have "new pages": https://stats.wikimedia.org/v2/#/all-projects/contributing/new-pages [21:56:17] varnent: we have a task to make stats available for All *wikipedias* [21:56:22] but that's not done yet [21:57:42] milimetric: Awesome - that would be great as we used the ones from the old portal in Comms regularly - and I just went to update them - it directed me to use zz. via the link here: https://stats.wikimedia.org/EN/TablesWikipediaZZ.htm [21:58:30] Generally all stats that we use in Comms are for either all Wikis - or all languages of a project - it's pretty rare that we use data for English Wikipedia for example :) [21:58:56] milimetric: the most common stats we use: https://office.wikimedia.org/wiki/Communications/Guidelines#Statistics [21:59:04] varnent: thank you for teaching me something about wikistats :) [21:59:06] and how you use it [21:59:14] I'll update the task with this [21:59:31] joal: i rolled back mediawiki avro [21:59:45] Awesome - the Office Wiki page is being moved to Meta-Wiki tomorrow (hence why I'm updating and adding references to where we got them) [21:59:47] def helpd [21:59:50] no more lag [21:59:53] CC SMalyshev [21:59:55] https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?orgId=1&var-instance=main-eqiad_to_jumbo-eqiad&from=now-3h&to=now [22:00:21] ok ottomata [22:00:55] ottomata: cave? [22:01:33] 10Analytics, 10Analytics-Wikistats: Wikistats 2.0: allow to view stats for all language versions (a.k.a. Project families) - https://phabricator.wikimedia.org/T188550#4036743 (10Milimetric) Adding some context from @Varnent. Comms uses these kinds of metrics often, see the full details here: https://office.wi... [22:02:56] ottomata: ok, looks good now [22:04:21] ottomata: looks like it went bad ~7:15, and completely broken at 10:20... did you find out what happened? [22:04:49] (03PS25) 10Mforns: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [22:05:48] joal, ottomata ^ unit tests for scala sanitization are ready, I will now move the parsing/loading of the Whitelist to WhitelistSanitization.scala and do some manual tests with real data, let you know [22:09:37] SMalyshev: not exactly, but we correlated it. [22:09:53] today I deployed a change the pointed the Mediawiki monolog kafka producer at this new jumbo cluster [22:10:47] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&panelId=13&fullscreen&var-server=kafka-jumbo1001&var-datasource=eqiad%20prometheus%2Fops&from=now-24h&to=now-1m [22:11:15] this caused produce request latency to go up all over, and we think we hit some weird combo of old producers hurting each other :p [22:11:24] we might need to upgrade the php kafka client mediawiki uses, it is very old [22:12:18] I see [22:12:37] anyway, I guess it should be safe to put updater back to Kafka now I assume... though we will probably do it on Monday [22:12:56] joal, is it wise to execute scala-spark tests now in the cluster? or will it interfere with your tests in mediawiki-history? [22:13:07] np prob for me mforns :) [22:13:12] they are EL sanitization, should be small [22:13:14] ok [22:13:26] SMalyshev: yeah, after all that i'm not going to point mw back at jumbo til we figure it out and fix that [22:13:31] i don't want to see a spikey graph like that [22:13:44] ottomata: cool, thanks! [22:13:47] I'd still recommend looking into using offsets for resuming when you can [22:13:56] ottomata: I will [22:14:00] cool [22:16:18] ottomata: will you make incident report for this? [22:18:41] ottomata, I think this sanitization thing will be ready to merge today, but... It still won't work in the cluster, because I have not yet translated the EL TSV whitelist into the extended YAML one, neither ensured its presence in analytics1003 via puppet [22:18:59] were you hoping it would work? [22:23:24] mforns: nono, just want to merge soon, so we can do a refeinery release for other things [22:23:28] wasn't going to try to run it yet [22:23:30] and ottomata, re-thinking, maybe the whitelist type-checks should stay in EventLoggingSanitization (at least for now), WhitelistSanitization should be format agnostic no? I mean, should not restrict the whitelist format to YAML. Also, if other jobs use YAML whitelists, we can move the YAML parsing code to a parsing utils file? Anyway, not sure it belongs to WhitelistSanitization.scala... [22:23:35] ok [22:25:06] I'd change the name from loadWhitelist to typeCheckWhitelist? [22:25:47] mforns: hm, i'd just thikn of it as a possible constructor(?) [22:26:03] you could overload the method with different types [22:26:35] yaml thogh? i thought you passed it an object? [22:26:53] either way mforns is fine :) [22:26:57] i'm not too picky about this one [22:27:00] yes, WhitelistSanitization receives currently a Map[String, Any] [22:27:41] SMalyshev: wasn't planning on it: ) but i guess i could. [22:27:45] ottomata, the thing is, EL whitelist is not only YAML, but also a special kind of "unstructured" YAML, [22:28:02] it does not have "schema", like: [22:28:06] table: [22:28:09] name: [22:28:13] keep: [22:28:20] - field1 [22:28:24] - field2 [22:28:26] etc. [22:28:54] so, it's kind of a specific logic that I'm not sure belongs to WhitelistSanitization.scala. [22:29:13] If in the future other jobs use the same format, then totally [22:29:41] aye k [22:31:19] (03PS26) 10Mforns: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [22:33:46] related question about Comms stats - I am noticing that some of the stats we got via v1 are different from v2 - even for same months - which do we use? [22:34:19] Example - v1 says total pageviews for Wikipedias in Jan 2018 was 16.02M - but v2 says total pageviews for All Wikis was 15.205M [22:34:51] given that v2 has all Wikis - and the number from v1 is just Wikipedias - you would assume the numbers would be different the other direction - and probably more different than they are [22:35:21] https://stats.wikimedia.org/v2/#/all-projects/reading/total-pageviews and https://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm [22:38:28] ottomata, ok thx :], the code is ready for review in gerrit, passes tests and also tested it with real data. cc joal [22:38:42] ack mforns! Thanks :) [22:40:33] (03PS7) 10Joal: Add by-wiki stats to MediawikiHistory job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) [22:44:13] varnent: is data for the same month? [22:44:41] bye team, see you next week! [22:44:43] varnent: ah , sorry, i see your link [22:46:17] *sorry B instead of M - but yeah - links above [22:47:08] varnent: i see 15,205,348,08 for month of january [22:47:16] varnent: on v2 [22:47:43] nuria_: right - for "All wikis" [22:48:21] and then 16,020 M for Jan 2018 for All languages - Wikipedia [22:49:17] "Page Views for Wikipedia, Both sites, Normalized - All languages (Σ)" [22:50:36] varnent: i think is going to be an issue on how months are labelled as this data comes from teh same place [22:50:44] varnent: named, our hadoop cluster [22:51:02] varnent: in both cases, but regardless one of teh labels is wrong [22:51:26] yeah - I'm trying to figure out which is which - lol - so is it 15M+ for all wikis - or 15M+ for all Wikipedias? [22:52:27] varnent: In v2 I read 17b for all projects for Jan2018 [22:52:46] varnent: there is an option of raw data versus normalized data in v1, did you see it? [22:53:30] varnent: look for "Switch to normalized data (for fairer comparison of monthly trends)" [22:53:57] I was looking at normalized - would you suggest raw? [22:55:27] varnent: the v2 data is raw, yes, so that would be 1 difference [22:55:52] varnent: now, i see one more bug we need to fix with x axis [22:56:22] okay - so yeah - for raw Jan 2018 - I see 16,554 M for "All Wikipedias" in v1 - and for v2 I see 15,205M for "All Wikis" [22:59:31] varnent: i see, let me verify one thing [22:59:51] cool beans - ty - to be clear - I am not like concerned or complaining or anything :) [23:00:02] I am just updating our stats for Comms usage and want to make sure we're being accurate :) [23:00:04] varnent: It seems to me 15.2b is for Feb 2018 - Jan 2018 has 17b I think [23:01:01] the stat we use right now is "Wikipedia is viewed more than 15 billion times every month" as we usually average out a few months and take the big number average [23:01:46] v1 talks about "all Wikipedias" and v2 talks about "All Wikis" - so in part I want to see if we need to update the stat to "Wikimedia sites are viewed more than 15 billion times every month" [23:02:52] joal: v1 says "14,739 M" for Feb 2018 for "All Wikipedias" - so 15.2B for All Wikis for Feb 2018 would make more sense - yeah [23:03:42] joal: I think our labels are wrong on ui [23:04:48] varnent: complaining is ok too [23:05:27] lol - ty :) - although tbh this one is more in the head scratching category :) [23:06:35] based on seeing these stats over time - does that sound about right though? 14.7B for WPs and .5B for all non-WP sites? [23:06:49] or if the Wikipedias have 14.7 should the non-WPs have more than .5B? [23:11:30] 10Analytics-Kanban: Wikistats: labeling of pageviews is wrong on table and graph - https://phabricator.wikimedia.org/T189266#4037011 (10Nuria) [23:12:21] 10Analytics-Kanban: Wikistats: labeling of pageviews is wrong on table and graph - https://phabricator.wikimedia.org/T189266#4037002 (10Nuria) {F14769918} Please @fdans take a look, i think we should fix this before continuing with responsive work, let me know if you disagree [23:12:29] 10Analytics-Kanban: Wikistats: labeling of pageviews is wrong on table and graph - https://phabricator.wikimedia.org/T189266#4037025 (10Nuria) a:03fdans [23:14:17] 10Analytics-Kanban: Wikistats: labeling of pageviews is wrong on table and graph - https://phabricator.wikimedia.org/T189266#4037002 (10Nuria) [23:14:33] varnent: filed ticket, super thanks for letting us know: https://phabricator.wikimedia.org/T189266 [23:15:25] 10Analytics-Kanban: Wikistats: labeling of pageviews is wrong on table and graph - https://phabricator.wikimedia.org/T189266#4037002 (10JAllemandou) Just double checked - Dates are good for me. Must be timezone related. [23:15:49] nuria_: posted a comment - Seems timezone related as I don't have the error [23:16:01] joal: whatatatata [23:16:16] joal: hopefully we are not labelling data with tz dates, ayayayay [23:17:26] nuria_: All our dates are UTC [23:17:38] joal: on api yes [23:17:39] nuria_: I have 2h diffs with UTC, so I see no diff [23:17:54] joal: on UI i bet we are using minute.js [23:17:56] As of now, you have 1 day diff with Utc :) [23:18:03] joal: w/o setting to utc explicitily [23:18:18] joal: i * think* i can fix this [23:18:26] joal: pretty fast if that is teh issue [23:18:29] nuria_: I won't even try :) [23:18:40] joal: jajaja cause spark is SO MUCH EASIER [23:18:42] as we know [23:18:47] :) [23:18:48] no biggies there [23:22:54] nuria_: thank you! :) [23:48:44] milimetric: I'm waiting for a final memory failure before restarting with more memory - no shuffle issue seen [23:51:37] milimetric: confirmed, relaunching with less executors [23:52:04] joal: ok [23:55:00] Gone to sleep, will check tomorrow morning [23:55:04] bye team