[02:25:28] 10Analytics, 10Patch-For-Review: Productionize navigation vectors - https://phabricator.wikimedia.org/T174796#3708716 (10Shilad) [08:05:27] * elukey checks joal's "AQS for druid only" patch [09:07:46] 10Analytics, 10Patch-For-Review, 10User-Elukey: Move away from jmxtrans in favor of prometheus jmx_exporter - https://phabricator.wikimedia.org/T175344#3709072 (10fgiunchedi) @Ottomata the metrics look good generally! A couple of things I noticed: These for example are the same metric name repeated with or... [10:01:12] 10Analytics, 10Research: geowiki data for Global Innovation Index - 2017 - https://phabricator.wikimedia.org/T178183#3709188 (10leila) [10:02:48] 10Analytics, 10Research: geowiki data for Global Innovation Index - 2017 - https://phabricator.wikimedia.org/T178183#3683651 (10leila) @Rafaesrey I just checked the query and I can relatively easily help you this year, no problem. I've made a note on my calendar that you want the data in the week of 2018-02-05... [10:09:34] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Set up ChangeProp for JobQueue in beta - https://phabricator.wikimedia.org/T178881#3709232 (10mobrovac) [10:13:27] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#3709237 (10mobrovac) [10:13:44] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#2995179 (10mobrovac) [10:14:38] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#2995179 (10mobrovac) [10:14:58] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#2995179 (10mobrovac) [10:15:02] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Set up ChangeProp for JobQueue in beta - https://phabricator.wikimedia.org/T178881#3709247 (10mobrovac) [10:16:36] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#3709253 (10mobrovac) [10:16:40] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Set up ChangeProp for JobQueue in beta - https://phabricator.wikimedia.org/T178881#3705976 (10mobrovac) 05Open>03Resolved CP4JQ has been set up on `deployment-cpjobqueue`. Resolving. [10:35:03] 10Analytics, 10EventBus: Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3709298 (10Pchelolo) [10:35:31] 10Analytics, 10EventBus, 10Services (next): Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3709312 (10Pchelolo) [10:55:18] 10Analytics, 10Research: geowiki data for Global Innovation Index - 2017 - https://phabricator.wikimedia.org/T178183#3709371 (10Rafaesrey) Great, thank you so much. Best, Rafael. [10:58:42] * elukey lunch! [12:54:59] a-team: in order to copy files from db1047 (analytics-slave) to db1108 (new db) I'd need (with Manuel) to stop mysql on db1047 for some time (even tomorrow is fine) [12:55:19] but report updater and possibly other things are relying on that db [12:55:35] so I am wondering the best procedure to make the maintenance as smooth as possible [13:32:44] !log restart yarn nodemanager and hdfs datanode on analytics1030 to apply new JVM settings [13:32:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:02:42] 10Analytics, 10Patch-For-Review, 10User-Elukey: Move away from jmxtrans in favor of prometheus jmx_exporter - https://phabricator.wikimedia.org/T175344#3709806 (10Ottomata) > These seem to have a label with potentially high cardinality, do you know how node_id changes over time? Ah, K had to look this up. T... [14:17:18] elukey: you never got an answer. Reportupdater will be ok, not sure what else relies on that server [14:21:21] milimetric: o/ [14:25:26] generally analytics-slave isn't as popular as analytics-store [14:28:47] fdans: you around? [14:28:58] milimetric: hellooo [14:29:08] cave? [14:29:12] sure [14:31:22] elukey: ops sync? [14:31:38] ottomata: one min sorry :( [15:00:52] ping elukey [15:02:10] 10Analytics-Kanban, 10Patch-For-Review: Rename datasources and fields in Druid to use hyphens instead of underscores - https://phabricator.wikimedia.org/T175162#3709969 (10Nuria) a:03JAllemandou [15:04:27] 10Analytics-Kanban, 10Patch-For-Review: Rename datasources and fields in Druid to use underscores instead of hyphens - https://phabricator.wikimedia.org/T175162#3709977 (10Ottomata) [15:06:32] 10Analytics-Kanban: Provide oozie job running ClickStream spark job regularly - https://phabricator.wikimedia.org/T175844#3709987 (10Nuria) a:03JAllemandou [15:21:27] 10Analytics-Kanban, 10Analytics-Wikistats: Handle negative values in charts - https://phabricator.wikimedia.org/T178797#3710147 (10Nuria) a:03Milimetric [16:01:41] ottomata, joal: we're starting our chat about maven parent pom. if you want to join... [16:01:56] https://hangouts.google.com/hangouts/_/wikimedia.org/parent-pom [16:02:06] gehel: finishing other meeting, joining now [16:29:16] taking a break for diner a-team, will be back after [16:51:40] is there any way to find out how much memory a spark executor is actually using? I know abut the executor-memory option, and spark.yarn.executor.memoryOverhead, but i'd like to see the actual used memory during runtime. This is because my java code calls C++ via JNI and i'm never completely sure how much memory the JNI actually uses and so just randomly guess at how much memoryOverhead [16:51:45] it needs(keep increasing till it stops failing) [16:53:00] ebernhardson: is it in the spark job ui? [16:53:54] ottomata: not that i can find, it only reports stats about the java side of things [16:54:16] i would need to know how much memory the kernel thinks has been allocated, since c++ allocates memory outside java's heap [16:54:27] hm [16:54:40] for a single process, not sure how that would be reported [16:54:53] one could log into a node where the executor is running and check the process live [16:55:05] but we don't report per process mem usage, especially for short lived processes [16:55:09] afaik [16:55:44] makes sense. These are actually relativly long running (on the order of a few hours, unless executors are killed and we spin up new ones) [16:58:01] right, but short lived i mean non puppetized/daemonized [16:58:08] by* [16:58:16] :) [17:28:33] ebernhardson: you can do a dump thread of process and get memory provided you have pid of process and that you can run jstack on the machine, is that crazy idea? [17:29:34] ebernhardson: heap dump not thread dump [17:30:05] nuria_: i can't log into the worker machines though [17:30:16] should i just request access for that? would be the easy-ish solution [17:30:34] ebernhardson: ah right, [17:31:02] ebernhardson: mmm.. that does not seem like a super good idea [17:31:20] ebernhardson: i can give you my ahem .. super not so good workarround that i have used before [17:31:38] workarounds are plausible [17:32:33] ebernhardson: sooo.. (alert, extreme craftiness) you can reduce the memory allocated to your process until you get an oom [17:34:15] nuria_: well, i currently do that in reverse but its pretty time consuming. I start with a small number and keep increasing till it stops blowing up. The problem with this approach is my data set (basically a matrix used for machine learning) changes size when i experiment with new features [17:34:39] that changing size changes what i need to set my overheads to. [17:35:26] ebernhardson: i see, as you add more features (rows to matrix, is that so?) the size of the objects your program holds increases [17:37:15] ebernhardson: the other thing i can think of is log memory using Runtime object ? [17:37:50] yes (features are usually columns though, and observations (query/article pair) are rows). Generally i try and guestimate from the total number of points in the dataset, sampling down to less rows if I increase the number of columns. It kinda works but was hoping to find something more concrete [17:40:45] ebernhardson: seems that spark gotta have some optimization for large matrix mutiplications [17:43:09] the thing is the actual matrix work isn't happening in spark, the matrix gets fed into a c++ library accessed via JNI to do the work. Would probably be way easier if it was implemented in java, but the only java implementation of lambda mart i could find is an academic implementation which doesn't have any way to parallize between machines. The production level learners are either [17:43:15] xgboost or lightgbm, both written in C++ [17:44:06] thankfully the jni layer is written upstream so just have to use it, but still have all the regular complications of JNI [17:44:45] * nuria_ looks what is lamba mart [17:47:19] ok, ya , no idea but an algorithm for better matching for documents to queries [17:49:40] yea, it's basically a specific objective for gradient boosted trees for learning ranking [18:00:40] * elukey off! [18:12:49] ebernhardson: Have you looked at https://github.com/cloudml/zen/ ? [18:13:09] There is a lambda mart implementation (just found it using google) [18:13:21] no clue about how good or bad it would be [18:14:03] AS for memory checking, I have no other idea than logging memory consumption trough accumulators [18:14:09] ebernhardson: --^ [18:16:02] joal: interesting. I can at least give that a shot, JNI is providing never ending annoyances in hadoop [18:16:20] ebernhardson: usual non-java stuff in java world [18:16:22] :( [18:23:30] nuria_: Thanks for your email! [18:23:40] I'm trying to get page content for all the pages in 'File' namespace on Commons and check if they have description or not [18:23:52] I know I can use the API, for example: /w/api.php?action=parse&format=json&page=File%3A%20Proboscis_monkey_(Nasalis_larvatus)_composite.jpg&prop=wikitext [18:23:53] chelsyx: ah i see then api might not work [18:24:00] but for all pages in 'File' namespace, that may be too many api calls? [18:24:10] chelsyx: ya, you will get throttled [18:25:26] nuria_: so my best option is to use hadoop/spark to parse the dump? [18:25:39] chelsyx: i think is actually your only option [18:25:40] chelsyx: Hi :) [18:25:57] chelsyx: was trying to think of something else you can do [18:26:08] chelsyx: Please give me a few minutes and I'll check if I can provide you with hadoop-based data [18:26:09] chelsyx: joal has done that [18:26:21] chelsyx: for other wikis [18:26:29] joal nuria_: Thank you!!! [18:27:05] chelsyx: not sure we have data for commons though, one sec [18:27:23] chelsyx: but even if we do not using our prototype code you can probably import yours [18:27:27] if what you are trying to parse can be represented in a regex, search might be able to pull it out. [18:28:11] joal: is the parsed data hdfs://analytics-hadoop/wmf/data/wmf/mediawiki/text? [18:28:22] joal: no wait that is mediawiki [18:29:12] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Create Druid public cluster such AQS can query druid public data - https://phabricator.wikimedia.org/T176223#3710799 (10Ottomata) [18:29:22] ebernhardson: yes. I think that can be represented in a regex [18:29:32] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Create Druid public cluster such AQS can query druid public data - https://phabricator.wikimedia.org/T176223#3617909 (10Ottomata) We gotta block on T179027 before we can add ferm rules to allow LVS health checks for druid-public-overlord LVS. [18:30:13] chelsyx: sorry, joal might know whether we have any data available, i think he moved the place where he is running experiments [18:30:22] chelsyx: do you just want a count, or? [18:30:31] also chelsyx, question for you -- Are you interested in historical data or only current one? [18:31:44] ebernhardson: Yes, I just want a count for now. How will you get it? [18:32:04] chelsyx: https://commons.wikimedia.org/w/index.php?search=insource%3A%2F%5C%7B%5C%7BInformation+%5C%7C%5BdD%5Description%2F&title=Special:Search&profile=advanced&fulltext=1&ns6=1 [18:32:23] randomly guessing at the regex, but that reports 191055 files matching /\{\{Information \|[dD]escription/ [18:32:57] joal: For now, I'm just interested in the current one, but SDoC team would like to know about the revision history in the future [18:34:48] ebernhardson: correct me if I'm wrong please ;) chelsyx - I think Search can help you for current revisions, probably not for historical ones [18:35:25] ebernhardson: That's brilliant! Thank you! [18:35:29] joa [18:35:40] right search doesn't have any historical data, although we do make dumps that could be loaded into a testing machine [18:35:46] chelsyx: I have data for all of commons up until 2017-06 (excluded) - For instance, I have 32 revisions for the Proboscis_monkey_(Nasalis_larvatus)_composite.jpg file [18:35:48] (but that would probably be pretty slow) [18:36:20] joal: Thank you! How can I get access to your data? [18:36:28] chelsyx: issue would be that this is very much a playgroupnd, not productionized, therefore not updated regularly [18:37:16] chelsyx: that means if you are going to use it you probably need to updated it yourself [18:37:28] chelsyx: parquet data: /user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2017-05/wiki_db=commonswiki; in hive: joal.mediawiki_wikitext [18:37:30] chelsyx: so you have historic + most up to date cc joal, correct? [18:37:35] joal nuria_: got it [18:37:54] chelsyx: these are just experiements [18:37:58] nuria_: not easy to update - easier would be to get a new dump and reparse it [18:38:13] joal: right, eamning a whole new snapshot [18:38:16] *meaning [18:38:29] chelsyx: in hive, need to specify snapshot = '2017-05' and wiki_db = 'commonswiki' [18:38:39] nuria_: correct [18:38:50] nuria_: I don't have better ways to do that currently [18:39:13] joal: ya, i do not think we need those at this time at all [18:39:23] nuria_: for 2017-05, data took about 2 days to get copied to HDFS, and then another 2 days to get parsed / transformed [18:39:46] nuria_: thing is to make productionised and not manual [18:40:10] anyway, not now :) [18:41:27] chelsyx: i suppose i should note many more complex regexes will fail when run through the web due to timeouts (they will report partial results and have a yellow box saying that). We can run them from the backend against the failover cluster if needed with excessively large timeouts if needed [18:41:29] joal nuria_ : Thank you all! I don't think the SDoC team need anything productionised at this moment. We are still exploring. So the playground is good! :) [18:42:01] good chelsyx - also chelsyx, bear in mind that parsing text is expensive ;) [18:42:58] ebernhardson: Got it. I don't think I need to run any regex that's more complex than the one you did at this moment ;) [18:44:46] joal: yeah. Thanks! :) [18:44:58] np chelsyx, have fun :) [19:45:32] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Mirror topics from main Kafka clusters (from main-eqiad) into jumbo-eqiad - https://phabricator.wikimedia.org/T177216#3711160 (10Ottomata) @fgiunchedi Ok, got a different prometheus jmx exporter config file up for Kafka Mirror Maker. (I'm using... [19:49:03] ottomata: Hey :) [19:49:43] I'm working on moving around some of the perf team services to newer hardware and/or VMs, as well as multi-dc. [19:49:56] hiya [19:50:35] At the moment, we don't store offsets for reasons (statsd). We're looking into bypassing Graphite alltogether by using Prometheus, but that's further in the future. [19:50:53] However I was wondering if it is still possible to ensure data is only processed once, without storing offsets. [19:51:05] E.g. to have a hafnium-ish VM in both DCs but remain working as currently. [19:51:55] also curious how Kafka works multi-dc, e.g. are offsets synchronised? Is it good or bad to have both codfw and eqiad nodes in the list of brokers? [19:52:53] ya that would be bad, doesn't work that way [19:53:04] so, the multi-dc support is done generally via MirrorMaker [19:53:10] which is a glorified consumer -> producer [19:53:24] consumer from kafka cluster in DC A produce to kafka cluster in DC B [19:53:27] so, offsets are not synced [19:53:36] but [19:54:11] can you clarify how you want it to work multi dc? [19:54:18] I'm open to best practices. [19:54:29] like, services in codfw -> hafniumish codfw -> statsd.codfw -> graphite codfw? [19:54:42] (is there a statsd,graphite.codfw?) [19:54:52] graphite is active/inactive. So incoming data should go to statsd in main DC. [19:54:59] replicates to codfw [19:55:06] ok, so ya, eventbus/change prop do this [19:55:29] so the kafka main cluster is mirrored in both eqiad and codfw with dc prefixed topics [19:55:30] like [19:55:43] {eqiad,codfw}.mediawiki.revision-create [19:55:43] etc. [19:55:59] producers in eqiad produce to eqiad.* -> main-eqiad kafka cluster [19:56:04] vice versa for producers in codfw [19:56:04] then [19:56:12] mirror maker is configured to consume eqiad* topics from eqiad -> codfw cluster [19:56:15] and vice versa [19:56:17] so that way [19:56:30] Afaik there is a full kafka cluster in eqiad and codfw equally, but I'm not sure how the pipeline is. E.g. do we produce to both? produce to one and replicate to another? produce to one,but which one is determined automatically? If I understand correctly, the way EventStreams is set up looks quite interesting to me. It seems to be listening to both clusters, but an event is only sent to one of them. So in case of a switch over, it [19:56:30] works as expected, nicely. [19:56:33] each main kafka clsuter (in eqiad and codfw) has all messages [19:56:35] but in different topics [19:56:44] well, not true, in the same topics, but [19:56:50] for active/inactive Mediwiki [19:57:01] for the most part only one DC's prefixed topics will receive messages at a given time [19:57:16] Right. [19:57:19] ya exactly. [19:57:33] so your hafniumish vm consumers can be set up in both DCs [19:57:39] and set to consume from both prefixed topics [19:57:45] .*.yourtopic [19:57:47] or [19:57:52] (eqiad|codfw).yourtopic [19:57:55] or whatever [19:58:02] and each produce to the local dc statsd instance [19:58:04] So that solves the incoming data issue. (Which is already important even in the current set up - right now hafnium only listens to eqiad kafka..) [19:58:14] then each statsd will have all metrics from both dcs [19:58:18] doesn't matter which dc is active [19:58:28] Right, but that's not how it works right now afaik. [19:58:38] statsd isn't active/active. [19:58:43] onlyl supposed to produce to the main one. [19:58:53] And can't have both webperf vms produce to eqiad statsd. [19:59:00] nono [19:59:02] so [19:59:06] the main-eqiad kafka cluster [19:59:09] will have two topics in it [19:59:15] eqiad.yourtopic, codfw.yourtopic [19:59:27] so it will have all metrics from both clusters (thanks to mirror maker) [19:59:35] your consumer can consume from both topics [19:59:45] and produce all metrics to statsd.eqiad.wmnet [19:59:53] Right, because they don't mirror each other, but together they represent everything. [19:59:56] the exact same thing can be done in codfw, producign to statsd.codfw.wmnet [20:00:12] so each DC's statsd instance will eventually get all the same metrics [20:00:19] while eqiad is active [20:00:22] the codfw will be mostly empty [20:00:29] codfw.yourtopic topic* [20:00:39] Yes, except there is no active statsd in codfw, if there would be, it woudl likely be producing to eqiad's graphite. [20:00:47] oh, why? [20:01:03] oh. [20:01:05] oh [20:01:18] so statsd in either DC produces to only the active DC's graphite? [20:01:50] Yes. It's like MediaWiki. In case of failure, (of either the service or the whole DC, or just for testing) the master is switched. [20:02:18] hm ok [20:02:19] Like MySQL masters>slaves, I suppose. Write to the main DC. [20:02:33] but that means that codfw metrics are produced to eqiad statsd while eqiad is master? [20:03:23] If we have active services in codfw, then yes their metrics will go to eqiad statsd, as it should be, otherwise nobody would see them. [20:03:47] aye ok [20:03:47] hm [20:04:19] oh right, also. we don't mirror eventlogging topics to codfw... [20:04:40] If we switch everything, then everything would switch at once, but that would only be the case for disaster. IN general, individual services will switch. Some services are already active-active and actively serving traffic. Such as planet.wikimedia's web server is LB'ed to both DCs by geo DNS. If it has statsd metrics, we'll want to see those in our graphite. [20:04:58] graphite in turn is replicated to multi-DC as cold standby [20:06:02] When MediaWIki becomes active-active, write requests, logstash and statsd will still go to the "primary" DC, unless we make everything produce directly to both. Which is possible, but that's a separate concern. [20:06:03] you do need eventlogging topics for your hafniumish services, right? [20:06:26] hm ya for sure htat's what you need [20:06:27] ok [20:06:31] hm [20:06:38] so those aren't even in codfw [20:06:54] so, you'll have to consume from eqiad even during a datacenter switchover [20:07:00] if eqiad blows up, your out of luck [20:07:03] Right now I'm trying to figure out 1) what current hafnium should really be consuming from, and whether it is prepared to deal with e.g. EventLogging switching to produce from Codfw for example, and 2) if we have webperf VMs ready in both DCs (or two in each DC, even), how should they subscribe. [20:07:35] "EventLogging switching to produce from Codfw for example" isn't happening [20:07:37] there is also analytics/statsv, but you'd know better than me how that one works. [20:07:40] :) [20:07:52] I hope it's on the backlog. Is there a cold standby provisioned at least? [20:07:57] no [20:08:06] not for a while [20:08:25] because these are all analytics purposes [20:08:37] they were never budgeted for multi dc availability [20:08:45] bbl, meeting, sorry [20:08:47] in case of catastrophic failure [20:51:39] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Mirror topics from main Kafka clusters (from main-eqiad) into jumbo-eqiad - https://phabricator.wikimedia.org/T177216#3711276 (10Ottomata) Ah, but I don't know what these ones are or where they are coming from! ``` kafka_consumer_consumer_fetch_ma... [21:42:34] 10Analytics-EventLogging, 10Analytics-Kanban, 10MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)), 10Patch-For-Review: PageContenSaveComplete. Stop collecting - https://phabricator.wikimedia.org/T177101#3711379 (10Nuria) Ok, table PageContentSaveComplete_5588433 can be dropped from mySQL, it is... [21:43:43] 10Analytics-EventLogging, 10Analytics-Kanban, 10MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)), 10Patch-For-Review: PageContenSaveComplete. Stop collecting - https://phabricator.wikimedia.org/T177101#3711382 (10Nuria) Docs updated ping @elukey about dropping table in mysql