[04:11:14] (03PS2) 10Milimetric: [VERY WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) [04:11:20] (03CR) 10jerkins-bot: [V: 04-1] [VERY WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) (owner: 10Milimetric) [04:11:45] lol, I never noticed it's called a "jerkins" bot [04:11:48] hahaha [04:12:05] thanks to whoever snuck that past :) [04:12:09] (probably Antoine) [07:19:04] 10Analytics, 10EventBus, 10MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), 10Patch-For-Review, 10Services (next): Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017#3765627 (10Pchelolo) @Ottomata I've reverted your change on kafka1001 as there's some `Attribu... [08:38:38] hello people [08:38:59] interesting view of druid's memory https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=druid1001&refresh=1m&orgId=1&panelId=14&fullscreen [08:39:59] I still need to figure out how much page cache is used by Druid, but I'd say that we should lower down heap usage for some Druid daemons [09:39:13] joal: o/ - I am merging https://puppet-compiler.wmflabs.org/compiler02/8805/aqs1008.eqiad.wmnet/ (localQuorum for AQS) [09:40:51] elukey: Yes ! following that [09:41:57] done for aqs1004 [09:42:04] aqs restarted and running localQuorum [09:42:27] elukey: for aqs1004 or aqs1008? [09:43:10] 1004, 1008 was only for the puppet compiler (to see what was about to change) [09:43:15] k [09:44:16] elukey: looks like several EL schemas may have seen a gap yesterday starting around 9am utc, followed by a spike later: https://phabricator.wikimedia.org/T179914#3764603 https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&from=1510704000000&to=1510790399000&var-schema=Popups ... [09:44:36] ... this can't really have to do with the migration of the master db, can it? (https://lists.wikimedia.org/pipermail/analytics/2017-November/006053.html ) [09:45:40] HaeB: hi! it is definitely related to the migration of the master db since due to my paranoia I completely stopped Eventlogging and all the replication [09:46:23] so the spike later on was when I re-enabled eventlogging [09:47:00] it should have restarted from its previous kafka offsets, so the gap was only due to maintenance [09:47:23] oh i see.... TBH i had understood "transparent to all the Event Logging users" as meaning that the data itself won't be affected (just its accessibility) [09:47:31] elukey: --^ Note for later - makes me think that it's actually better to only stop Mysql consummer than all EL when feasible [09:49:11] HaeB,joal - sure, this makes sense, but it was too delicated in my opinion to risk anything. The meaning that I had in mind was that the data collected would not have showed any loss, those graphs are indicating that the EL workers were paused for a bit [09:49:25] but I should have been more clear, completely agreed [09:50:39] sure, that's your call! but we need to know when it affects data integrity (that includes timestamps) [09:51:29] HaeB: I'd be interested to know if the above spike is reflected in timetamps [09:51:49] HaeB: if timestamps were affected then it is due to my ignorance, my understanding was that they wouldn't have been affected because stored in kafka [09:51:54] joal: you mean in the mysql timestamps? [09:52:02] HaeB: If it is, it;s a strong argument in favour of not stopping EL at large (only MySQL consummers) [09:52:31] If timestamps are not affected, then the artifact has no impact on data quality [09:52:57] I'm not fluent enough in EL to recall where in the process timestamps are set [09:54:57] Agreed, the eventual data analysis would not be affected for those using mysql, but people also go to grafana to check data quality ;) (cf. phab link above) [09:56:04] HaeB: you are completely right, I didn't think about this use case, really sorry for the time wasted :( [09:56:35] HaeB: While I understand that grafan is useful, it reflects what happens in term of events flowing through the system and has been built for that - The spike is therefore completely right given yesterday's operation [09:56:58] ...or build whole dashboard on it... the gap seems visible in https://grafana.wikimedia.org/dashboard/db/performance-metrics?refresh=5m&orgId=1 [09:57:11] HaeB: To me, the concern is more on people using tools to check data quality that shouldn't be used for that [09:58:23] joal: you are also right, but it is true that I didn't account the gaps in the graphs as possible side effect of my bold "stop eventlogging completely" [09:58:40] so my fault for that, it can indeed confuse people [09:59:54] HaeB: do you want me to comment in the task or will you take care of that? [10:00:59] i'm happy to do it, but if you are going to add a note to the exiting thread on analytics-l, i'll link to that [10:01:54] 10Analytics-Kanban, 10Patch-For-Review: Investigate the use of local_quorum for AQS - https://phabricator.wikimedia.org/T164348#3765953 (10elukey) [10:02:29] HaeB: sure, will do it later on :) [10:07:48] elukey: I need to triple check our loader to see if local-one can be changed to local-quorum without issue [10:08:10] joal: let me know if I can help! [10:08:29] elukey: I recall the loader uses tricks to send the right data to the right nodes - I wonder if the quorum thing isn't maintained in the laoder itself [10:08:53] in the meantime, I can say that the log database on db1108 has been fully sanitized up to 90 days ago!!!!!! [10:08:56] * joal needs to understand code written by past-joal - This will be hard [10:09:07] That's awesome :) [10:09:13] OK, I checked in case of the Popups schema, and the timestamps in the table itself (the Hive version actually) seem fined, i.e. don't show that gap from Grafana: [10:09:38] https://www.irccloud.com/pastebin/HNLBpSrs/hourly%20event%20rates%20for%20Popups%20on%202017-11-15 [10:09:50] HaeB: That was what I would have expected - Impact is only on events flowing, and he only not-EL user of the system (performance) [10:10:17] HaeB: But as elukey stated, we should have sent an email [10:10:53] well I should have known that first, and now that I am a bit less ignorant I'll avoid the same mistake in the future :) [10:11:27] That's the thing elukey - breaking stuff is the best (if not almost only) way to learn about it ! [10:14:22] joal: I am planning to leave aqs1004 running for a bit, and then restart the others after lunch [10:14:33] Works for me elukey :) [10:15:15] Oh, by the way almost forgot because she's sleeping: a-team - Naé is sick today and not at the crèche - There'll be moments when I'll be not online [10:15:36] joal: yes, good news! in general i'm still a bit confused about what kind of timestamp grafana uses, in particular because it came up last week in the general discussion about the EL timestamp formats with ottomata https://phabricator.wikimedia.org/T179540#3737504 [10:16:13] HaeB: Grafana don't use inside-event-timestamps - it uses machine timestamp (therefore reflects events flowing) [10:18:03] inside-event in what sense? [10:18:35] they are all server-side timestamps, no? [10:18:41] HaeB: inside-event meaning a timestamp stored in the event data, allowing for delay in treatment [10:19:06] the timestamp is the one that varnishkafka registers when sending the event to kafka [10:19:30] HaeB: in grafana, the timestamp is server-side generated by kafka - while the timestamp in EL events is generated by VK [10:20:11] i see [10:20:44] so if one adds a client-side timestamp (which we are actually doing right now in one case), that's three layers of timestamps ;) [10:21:31] HaeB: there always are more timestamps than we need ;) [10:22:43] a while ago phuedx|afk and i were already wondering about discrepancies between grafana event rates and mysql event rates for EL (per minute) [10:23:58] but that means that the grafana one is always a bit less accurate (subject to further delays compared to the original point in time where the actual event was generated client-side), right? [10:24:06] HaeB: When you say grafana event rates, I'm assuming you talk about the chart you sent before, right? Because mysql event rates are also shown in grafana IIRC [10:25:12] yes, the y-axis in https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&from=1510704000000&to=1510790399000&var-schema=Popups [10:25:24] HaeB: It's nothing to do with accuracy per say - validated-kafka-topic metrics (I think that's the correct name of the chart type we are talking about, that display the rate of valid events flowing in kafka by topic) [10:25:41] where does grafana show mysql event rates? [10:26:15] provide accurate metrics, but are tied to kafka - meaning it shows how many events were received in the choosen topic in kafka over a time-period [10:26:30] https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1 [10:26:39] HaeB: -^ [10:27:02] or that one HaeB: https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&panelId=12&fullscreen [10:27:13] to only see the one we talk about [10:27:58] So HaeB, when looking at the grafana charts for validated-topics, you see the rates of events flowing in kafka [10:28:24] Most of the time, it is representative of the realtime input rate of events [10:28:49] Except when the EL-valditor is stopped (or doesn't process events fast enough) [10:28:54] makes sense ? [10:30:20] how do i change https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1 to look at yesterday's data? (to check the gap /n non-gap around the migration time) [10:30:49] selecting "Yesterday" results in empty graphs for "Eventlogging by schema" etc https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&from=now-1d%2Fd&to=now-1d%2Fd [10:32:04] elukey: Were my explanations clear for you as well (who knows the system)? [10:33:09] joal: yes, makes sense. and obviously "accurate" depends on what you use it for, but there do seem to be people who use it for information about the original timing of events (client-side creation) [10:33:24] joal: yep [10:33:26] ...so for that purpose it would be a bit less accurate, i understand [10:34:23] HaeB: right - the use-case you present is most-of-the-time ok, because the system is fast enough and processes events in pseudo-real-time [10:34:47] HaeB: But we should let the users know about the inner of what it means (as we did today) [10:36:21] HaeB: You seem to know who uses the system in non-accurate way - Could you please either let them know of their mistake, and possibly tell them to talk to us if needed? [10:37:08] HaeB: Something interestin [10:37:11] HaeB: https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&panelId=12&fullscreen&from=now-1d%2Fd&to=now-1d%2Fd&var-topic=eventlogging_Popups [10:37:41] On that chart, the spikes represent mysql catching up on events that came late [10:38:32] joal: well, except that MySQL replication is deactivated for Popups [10:38:43] (03CR) 10Amire80: [C: 031] "Thank a lot again." (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/365517 (https://phabricator.wikimedia.org/T170764) (owner: 10Amire80) [10:38:49] it's replicated in Hive only [10:40:17] HaeB: Arf, my bad - the mysql insertion rate chart shows all topics - the filter doesn't apply [10:40:20] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Resolve EventCapsule / MySQL / Hive schema discrepancies - https://phabricator.wikimedia.org/T179625#3766084 (10Tbayer) Thanks! This has stopped now (T178500), so feel free to go ahead. [10:57:15] joal: (who uses...) well, for starters, i understand that the perf team's dashboards like https://grafana.wikimedia.org/dashboard/db/navigation-timing?refresh=5m&orgId=1 are based on the NavigationTiming schema. they do show that gap, so i guess they are based on the less accurate (for that purpose) kafka timestamp [10:57:29] but that's not my problem or area of expertise ;) [10:58:53] HaeB: I think you're right: Perf dashboards insert events that comes in realtime, and discard not-time-matching events [10:59:20] But that would require double checking with them :) [11:07:54] 10Analytics-EventLogging, 10Analytics-Kanban: Timestamp format in Hive-refined EventLogging tables is incompatible with MySQL version - https://phabricator.wikimedia.org/T179540#3766156 (10Tbayer) >>! In T179540#3741320, @Ottomata wrote: >> Aye! but MySQL users are not the only user of this data! The performan... [12:09:07] joal: re druid and page cache - from what I can read if druid uses a cache it will be stored in the heap, otherwise it will read from disk and use the page cache [12:09:29] and https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=druid1001&refresh=1m&orgId=1&panelId=14&fullscreen is interesting [12:13:37] hm elukey - I don't understand why the chart showing memory is mostly used is so interesting :) [12:14:00] IIRC we tried o give as much memory to historical and broker as we could (with Andrew) [12:17:10] (03PS2) 10Fdans: Adds metric schema checking and improves display of number units [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391003 (https://phabricator.wikimedia.org/T178084) [12:20:13] joal: yep! It is interesting since https://grafana.wikimedia.org/dashboard/db/prometheus-druid shows that we are definitely not using all that ram allocated [12:20:25] and the page cache is limited due to that [12:20:44] (the cached showed in the graph is only 7/8G) [12:20:47] Ahhhh ! I get it elukey :) It is indeed interesting :) [12:20:59] (03PS3) 10Fdans: Adds metric schema checking and improves display of number units [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391003 (https://phabricator.wikimedia.org/T178084) [12:21:01] :) [12:21:08] elukey: Need to go for doctor with Naé - will be back later to discuss [12:21:12] o/ [12:23:26] * elukey lunch! [13:10:34] restarted aqs on aqs100[56] [13:10:44] half of the cluster is running with localQuorum [13:16:04] can we reboot bohrium to 4.9.51 or is it a bad time? [13:35:40] moritzm: I think it is fine, we can reboot it [13:37:40] now? [13:45:05] moritzm: yep [13:45:11] (sorry I was reviewing a puppet cr) [13:46:34] k, going ahead [13:49:15] bohrium is back up [13:49:38] nice thanks! [13:52:55] aqs restart completed [14:13:53] (03PS7) 10Milimetric: Create Oozie job for interlanguage nav table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/365517 (https://phabricator.wikimedia.org/T170764) (owner: 10Amire80) [14:14:00] (03CR) 10Milimetric: [V: 032 C: 032] Create Oozie job for interlanguage nav table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/365517 (https://phabricator.wikimedia.org/T170764) (owner: 10Amire80) [14:18:19] hm... beeline has gotten better since I used it, accepts things like alt+backspace properly now [14:20:00] (03CR) 10Fdans: [C: 032] Adds metric schema checking and improves display of number units [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391003 (https://phabricator.wikimedia.org/T178084) (owner: 10Fdans) [14:31:18] -nick mforns [14:31:22] O.o [14:40:29] mforns: o/ - db1108's log db sanitized up to 90 days ago! [14:40:50] elukey, \\\o/// [14:40:54] I am writing the puppet patch for the cron [14:41:07] this is the end of several years of work! [14:44:18] https://gerrit.wikimedia.org/r/#/c/391828/1/modules/profile/manifests/mariadb/misc/eventlogging/replication.pp [14:49:35] congrats you two [14:49:45] and madhuvishy [14:50:10] this is hard work and the reward is people grumble [14:58:28] a-team: amazing job on the docs, you are all awesome. I can google (in incognito) "wikitech deploy refinery" or "wikitech anything I forgot about team Analytics" and it's the top hit. So good. [15:25:48] !log deployed refinery and running interlanguage links dataset now [15:25:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:30:17] (03CR) 10Mforns: "LGTM in general. Left a couple minor comments." (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/383798 (https://phabricator.wikimedia.org/T177491) (owner: 10Fdans) [15:31:59] elukey: you around? [15:32:12] I'm a little confused about the new db cnames [15:32:52] geowiki and I are looking in vain for the staging.erosen_... tables, which used to be, I think, on analytics-slave or x1-analytics-slave or something-slave [15:32:55] now they're nowhere to be found [15:35:08] milimetric: I am, sorry I was brewing coffee :) [15:35:16] so analytics-slave -> db1108 [15:35:21] never be sorry for coffee, unless it's bad [15:35:28] *-analytics-slave -> dbstore1002 [15:35:34] +1 cool docs! [15:35:59] milimetric: but db1108 has a brand new staging db [15:36:11] db1047 has probably the one that you need [15:36:12] oh, ok. So what's pointing at whatever *-analytics-slave was pointing at before? [15:36:24] oh, ok, and nothing points to it [15:36:25] hm... [15:36:34] that's not good, means we gotta move those tables maybe [15:37:06] yep, confirmed elukey, they're on db1047 [15:37:14] sorry I missed that in your plans [15:38:32] milimetric: nono it is good that we discuss them, I thought to copy the staging db to dbstore1002 and name the db to something like staging_db1047 [15:38:57] but if you think that db1108 should have those tables I'll talk with the dbas to move them over (probably I can do it with a mysql dump) [15:39:20] oh, elukey that's fine too, a separate db would mean less work and change for everyone [15:41:09] ok, then, shall we agree on staging_db1047 as the name? [15:43:32] (03PS1) 10Milimetric: Update database where data is stored [analytics/geowiki] - 10https://gerrit.wikimedia.org/r/391841 [15:43:50] (03CR) 10jerkins-bot: [V: 04-1] Update database where data is stored [analytics/geowiki] - 10https://gerrit.wikimedia.org/r/391841 (owner: 10Milimetric) [15:46:49] milimetric: I like it, maybe let's bring it up to the standup? [15:47:13] ok elukey, will do [16:00:05] 10Quarry, 10Cloud-Services, 10Community-Wikimetrics, 10DBA, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#3766900 (10jcrespo) [16:28:44] 10Analytics-Kanban: Add documentation for .m suffix code to pagecounts-ez doc page - https://phabricator.wikimedia.org/T180452#3766992 (10fdans) [16:31:03] 10Analytics-Kanban, 10Continuous-Integration-Config: Add CI to all analytics/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180301#3767004 (10fdans) [16:54:31] mforns: if you have a minute, I am reviewing the databases to drop on db1047 (reading from https://phabricator.wikimedia.org/T156844#3107412 onwards) [17:14:40] 10Analytics, 10EventBus, 10MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), 10Patch-For-Review, 10Services (next): Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017#3767144 (10Ottomata) Ok, good catch, thanks. Will figure that out next week then. [17:15:20] 10Analytics, 10DBA, 10Operations, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3767146 (10elukey) I am currently reviewing what tables to drop on db1047 and which ones to copy over to db1108, and this is what I gath... [17:15:23] mforns: summary in https://phabricator.wikimedia.org/T156844#3767146 :) [17:15:28] also milimetric --^ [17:15:33] (when you guys have finished) [17:17:29] 10Analytics, 10DBA, 10Operations, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3767179 (10jcrespo) ops are ours, we can handle that- just leave things as you found them. test is probably a mistake and probably shoul... [17:17:31] 10Analytics-EventLogging, 10Analytics-Kanban: Timestamp format in Hive-refined EventLogging tables is incompatible with MySQL version - https://phabricator.wikimedia.org/T179540#3767180 (10Ottomata) > [10:19:30] HaeB: in grafana, the timestamp is server-side generated by kafka - while the timestamp in... [17:20:18] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3767196 (10Pchelolo) Out of the IRC discussion we've got 3 candidates for the next migration: - `wikibase-UpdateUsagesForPag... [17:47:30] mforns: maps? [17:48:55] * elukey off! [17:49:18] fdans, oh! thought it was at 20h [17:49:20] coming! [18:42:54] Hi all: I am trying to kick off a spark job running on stat1005 but getting an OOM error. [18:43:17] Looks like a big R process is running and using 43G of memory. [18:43:20] Anybody know about this? [18:43:58] Actually.. sorry... a Python process is using 43G of mem [18:45:06] sadly notebooks have a tendency to accumulate memory :( [18:46:16] That's some serious accumulating :) [18:47:31] dsaez: it looks like your process, any idea? [18:48:06] iirc dsaez is in europe though, so he might be gone for the day [18:48:11] let me check [18:48:27] I'm here ebernhardso [18:49:01] you wouldn't find me here at 8pm :) but i never know peoples work schedules [18:49:20] hehe [18:49:41] Thank you ebernhardson and dsaez [18:49:44] solved [18:49:47] thanks! [18:50:16] btw, there is user flemmer+ always using from 16G to 54G [18:50:34] not sure if G or % [18:51:53] looks like a notebook as well. I'm not sure if useful, but there is also notebook1001.eqiad.wmnet which has 48G of memory. not sure if it has all the same access as stat1005 [18:53:00] but it has jupyter already running: https://wikitech.wikimedia.org/wiki/SWAP [19:08:30] ebernhardson: yes but those machines doesn't have too much HD space [19:08:41] nor sshfs to mount from other machines [19:09:05] nor XML dumps mounted :S [19:28:40] milimetric: I confirm that a page restored in an non-history-merging way generates a new page_id, even in newer times (tested 2013) [19:28:47] mforns: --^ [19:29:02] joal, aha [19:29:27] joal: weird indeed [19:29:27] hm, should we change the code because of that [19:29:28] ? [19:29:42] I guess... it makes things easier? [19:29:46] mforns: it should involve modifications [19:30:07] mforns: but globally, we need to revamp page-history rebuilding for delete/restores [19:30:28] yea [19:51:27] mforns, milimetric - trying to regenrate reduced data matching wk1 [19:51:37] will have results tomorrow [19:52:41] good luck joal [19:54:16] O.o [20:59:10] (03CR) 10Milimetric: [C: 04-1] "database name may not actually change after all, waiting to make sure then will abandon" [analytics/geowiki] - 10https://gerrit.wikimedia.org/r/391841 (owner: 10Milimetric) [21:01:29] gone for tonight a-team - see you tromorrow [21:01:36] byeeeeee [21:01:38] nite jo [23:48:13] (03CR) 10Krinkle: "I haven't looked too closely, but I suspect it might be easier to deploy if not the script and the call site in Puppet are required to cha" [analytics/statsv] - 10https://gerrit.wikimedia.org/r/391703 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata)