[00:16:13] Nettrom: That's what I'm doing, too. It's gross. I think the proper solution would be switching the partition to a string: "YYYY-MM-DD:HH" [00:16:26] But that is probably a painful change :) [06:52:18] (03PS1) 10Shilad Sen: WIP: Spark job to create page ids viewed in each session [analytics/refinery/source] (nav-vectors) - 10https://gerrit.wikimedia.org/r/381169 (https://phabricator.wikimedia.org/T174796) [07:19:36] joal: morning! [07:19:51] I have a theory for the mess that happened yesterday to the history server [07:22:32] I checked the metrics and it seems that the heap pressure was already 80/90% right before the event, and then $something triggered more allocation that ended up causing more gc pressure and work to do minor/major collections [07:22:55] and eventually GC overhead -> daemon kaput [07:23:55] after the explanation that you gave us yesterday during standup about what the history server do I am wondering if huge jobs that allocates a ton of containers (that needs to register themselves when they finish etc..) could cause peak of heap utilization [07:24:07] the current settings are Xmx1g, not enough [07:25:58] we could go to 2g, but since we have a ton of space on an1001 not utilized, I'd go for 4g [07:26:33] so if my theory is correct, we should be ok when huge jobs kick in and the cluster is already busy [07:29:54] just merged the puppet change :P [07:30:02] need to restart the history server though [08:36:48] Thanks a lot elukey :) [08:36:53] That is great analysis :) [08:37:23] I have double cheked jobs this morning, everything seems back in track, and with the change to 4G heap for history server, we should be on the safe side [08:38:02] super :) [08:38:19] do you think that we could restart it now? Or maybe let the cluster drain? [08:43:51] elukey: I'd suspend camus, wait for drain then restart [08:44:06] We've seens that running jobs don't like history server dying in the liddkle [08:44:12] *middle sorry [08:47:41] all right let's do it [08:48:52] Ok !!! [08:49:02] camus disabled [08:49:19] elukey: We have a job from bearloga :( [08:49:35] elukey: It should be small [08:50:48] elukey: One webrequest load job still to finish [08:50:55] elukey: You have time for coffee ;) [08:51:50] there is always time for coffee! [08:52:00] :D [09:44:49] hellooooo [09:49:20] elukey: the jobs from bearloga are automated by report updater, we won't manage to drain the cluster - Let's restatrt now history server now [09:49:23] Hi mforns :) [09:51:41] !log restart mapreduce history server on an1001 to apply new heap settings (Xmx/s to 4g) [09:51:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:01:17] joal: all good from my side, ok to re-enable camus? [10:01:38] elukey: YES ! [10:01:44] Thanks a lot elukey [10:02:31] !log renabled camus after maintenance [10:02:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:03:09] joal: take a look to the current heap size of the namenodes - https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [10:03:42] no more old gen collections \o/ [10:03:46] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1&from=now-7d&to=now [10:03:58] Looks like something happened yesterday :) [10:04:03] it is like "yessss more spaceeeee" [10:04:55] :) [10:05:05] the datanodes heap utilization is a bit high and I can see old gen collections, but it is usally like that.. [10:05:11] (4g would be nice in there :P) [10:05:39] :) [10:05:48] elukey: What prevents us to do so? [10:08:05] joal: theoretically nothing, I am seeing space on the worker nodes (old and new gen) - https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=analytics1060&refresh=1m&orgId=1 [10:08:53] elukey: looking back a month, for that machine, looks like here is enough space to bump [10:12:37] good news, the kafka jumbo cluster should have prometheus metrics very soon [10:20:42] Great elukey :) [10:20:54] elukey: sorry for me bothering - how are we moving forward with Druid? [10:21:37] joal: no bother :) - My plan was to re-review one of the huge puppet changes that andrew made to refactor our codbase and allow multiple cluster definition [10:21:43] and then merge it after lunch [10:21:55] after that, there will be another code review to split the clusters [10:21:57] elukey: looks awesome :) [10:22:02] but it will require manual work [10:22:08] ok [10:22:26] let me know if I can help for manual work (I'm no good for puppet, but I can move stuff around) [10:28:05] joal: sure! I have to warn you thought that Andrew stated clearly that we'll probably will not make it in time for the EOQ [10:28:30] elukey: I know, I just continue to push to make it fast :) [10:28:37] even if not fast enough for EOQ [10:28:47] okok :) [10:30:32] * elukey lunch! [12:07:18] * fdans chinese food for lunch! [12:12:35] Taking a [12:12:39] break [12:12:43] ! [12:26:11] (03PS1) 10Addshore: instanceof.php short sleep to avoid rate limit [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/381207 (https://phabricator.wikimedia.org/T176577) [12:26:29] (03PS1) 10Addshore: instanceof.php short sleep to avoid rate limit [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/381208 (https://phabricator.wikimedia.org/T176577) [12:26:34] (03CR) 10Addshore: [C: 032] instanceof.php short sleep to avoid rate limit [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/381208 (https://phabricator.wikimedia.org/T176577) (owner: 10Addshore) [12:26:36] (03CR) 10Addshore: [C: 032] instanceof.php short sleep to avoid rate limit [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/381207 (https://phabricator.wikimedia.org/T176577) (owner: 10Addshore) [12:26:42] (03Merged) 10jenkins-bot: instanceof.php short sleep to avoid rate limit [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/381208 (https://phabricator.wikimedia.org/T176577) (owner: 10Addshore) [12:26:45] (03Merged) 10jenkins-bot: instanceof.php short sleep to avoid rate limit [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/381207 (https://phabricator.wikimedia.org/T176577) (owner: 10Addshore) [13:02:08] joal, about to merge Andrew's refactoring for Druid [13:22:22] hetall [13:22:29] *heyall [13:26:12] helooo [13:31:20] fyi people I am working on druid1001 [13:31:32] let me know if you are doing anything like restart etc.. [13:31:33] :) [13:46:48] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3592394 (10pmiazga) @ovasileva - do we need to store also the sessionToken (to detect how many prints we have per session)? It'... [14:03:04] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3592394 (10Jdlrobson) This has been sitting here for a week. To get code review we'll need to be a little more proactive. Have... [14:52:05] 10Analytics, 10Operations, 10Traffic: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3643098 (10ema) p:05Triage>03Normal [14:53:37] (03CR) 10Nuria: WIP: Spark job to create page ids viewed in each session (031 comment) [analytics/refinery/source] (nav-vectors) - 10https://gerrit.wikimedia.org/r/381169 (https://phabricator.wikimedia.org/T174796) (owner: 10Shilad Sen) [14:53:49] Shilad: I added some comments to your CR [14:54:07] Shilad: we can talk about it in more detail if they do not make sense [14:57:54] Shilad: the number of different signatures that you get in a day (per domain) should be on the order of magnitude of unique devices per day per domain but i think your methodology will return a smaller number [15:00:17] ping mforns joal milimetric [15:00:24] hello [15:00:28] oh! [15:09:13] hangouts kicked me out again! [15:23:50] 10Analytics: Add action api counts to graphite-restbase job - https://phabricator.wikimedia.org/T176785#3643192 (10mforns) [15:24:06] 10Analytics-Kanban: Add action api counts to graphite-restbase job - https://phabricator.wikimedia.org/T176785#3636597 (10mforns) [15:27:13] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3643229 (10mforns) [15:30:16] 10Analytics-Cluster, 10Analytics-Kanban: CamusPartitionChecker does not work when topic names have '.' or '-' in them. - https://phabricator.wikimedia.org/T171099#3643246 (10mforns) [15:30:56] 10Analytics-Cluster, 10Analytics-Kanban: CamusPartitionChecker does not work when topic names have '.' or '-' in them. - https://phabricator.wikimedia.org/T171099#3454060 (10mforns) [15:35:44] 10Analytics, 10EventBus, 10Wikimedia-Stream: Hits from private AbuseFilters aren't in the stream - https://phabricator.wikimedia.org/T175438#3593335 (10mforns) Hi @Nirmos, I'm not super familiar with the AbuseFilters. Can you please give us an explanation of the flow that you are seeing and the one that you'... [15:36:13] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3643274 (10Jdlrobson) @mforns it looks like the fix for T169730 is in production now (it's in 1.30.0-wmf.19 which is everywhere). [15:36:35] 10Analytics-Kanban, 10Analytics-Wikistats: Stub new mediawiki history-based metrics - https://phabricator.wikimedia.org/T175268#3643276 (10mforns) [15:37:43] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats unique devices metrics needs some copy that says "monthly" - https://phabricator.wikimedia.org/T176240#3643284 (10Nuria) [15:37:55] 10Analytics-Kanban, 10Analytics-Wikistats: Stub new mediawiki history-based metrics - https://phabricator.wikimedia.org/T175268#3588195 (10mforns) [15:38:43] 10Analytics-Kanban, 10Analytics-Wikistats: Add top articles by pageviews metric - https://phabricator.wikimedia.org/T175266#3643286 (10mforns) [15:39:04] 10Analytics-Kanban, 10Analytics-Wikistats: Add top articles by pageviews metric - https://phabricator.wikimedia.org/T175266#3588171 (10mforns) [15:40:39] 10Analytics-Kanban, 10Analytics-Wikistats: Stub new mediawiki history-based metrics - https://phabricator.wikimedia.org/T175268#3643322 (10JAllemandou) URIs to be mocked are defined as swagger-config in Restbase pull request: https://github.com/wikimedia/restbase/pull/875 [15:42:17] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3643327 (10bmansurov) >>! In T175395#3642855, @Jdlrobson wrote: > This has been sitting here for a week. To get code review we'... [15:44:43] 10Analytics: Rename datasources and fields in Druid to use underscores instead of hyphens - https://phabricator.wikimedia.org/T175162#3584677 (10mforns) Let's change banner_activity_minutely to hyphens and that's that. [15:45:00] 10Analytics: Rename datasources and fields in Druid to use hyphens instead of underscores - https://phabricator.wikimedia.org/T175162#3643336 (10mforns) [15:45:38] 10Analytics: Rename datasources and fields in Druid to use hyphens instead of underscores - https://phabricator.wikimedia.org/T175162#3584677 (10mforns) p:05Triage>03Normal [15:45:46] 10Analytics, 10Analytics-Wikistats: Address design feedback from Volker - https://phabricator.wikimedia.org/T167673#3643343 (10Nuria) [15:46:06] 10Analytics-Kanban: Rename datasources and fields in Druid to use hyphens instead of underscores - https://phabricator.wikimedia.org/T175162#3584677 (10mforns) [15:48:11] 10Analytics: Productionize streaming jobs - https://phabricator.wikimedia.org/T176983#3643365 (10Nuria) [15:49:02] 10Analytics-Kanban: Productionize streaming jobs - https://phabricator.wikimedia.org/T176983#3643380 (10Nuria) [15:50:06] 10Analytics-Kanban: Productionize streaming jobs - https://phabricator.wikimedia.org/T176983#3643365 (10Nuria) [15:50:21] 10Analytics: R execution on stat1005 -> 'stack smashing error' - https://phabricator.wikimedia.org/T174946#3577730 (10mforns) @Erik_Zachte This is probably because of the new debian stretch stat1005 is running on. [15:50:31] 10Analytics: Productionitize netflow job - https://phabricator.wikimedia.org/T176984#3643389 (10Nuria) [16:10:00] 10Analytics, 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3643514 (10Nuria) [16:29:58] gone for diner, will be back after [16:56:06] nuria: I saw your note about issues constructing session identifiers from IP for mobile sessions. Thanks! Do you know if this also confuses X-Forwarded-For? My understanding is that NAT typically set this correctly, but perhaps this is not the case for particular OSs/Carriers? [16:56:52] sorry.. in wrong channel... I'll move to wikimedia-analytics correct one and resend. [16:57:44] ...and i guess i am there. Definitely have not mastered [16:58:00] going offline people! [16:58:01] * elukey off! [17:05:38] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3643767 (10mforns) @bmansurov @Jdlrobson Yes, thanks! Will have a loot at this tomorrow. [17:07:18] hey team, leaving for today, tomorrow I'll also start the day earlier, byeee [17:08:44] 10Analytics-EventLogging, 10Analytics-Kanban, 10Page-Previews, 10Readers-Web-Backlog, and 5 others: EventLogging subscriber module in ready state but not sending tracked events - https://phabricator.wikimedia.org/T175918#3643776 (10phuedx) Per T175918#3632663, this can't be signed off until 21:00 ([[ https... [17:36:11] 10Analytics-Kanban: vet edit data on the data lake - https://phabricator.wikimedia.org/T153923#3643922 (10Nuria) a:05Milimetric>03ezachte [17:36:21] 10Analytics-Kanban: vet edit data on the data lake - https://phabricator.wikimedia.org/T153923#2895122 (10Nuria) Assigning to Eric and moving to radar [17:36:41] 10Analytics: vet edit data on the data lake - https://phabricator.wikimedia.org/T153923#3643924 (10Nuria) [17:49:55] phuedx: did you get the data you needed for pdf rendering [17:49:56] ? [17:51:20] 10Analytics: Private geo wiki data in new analytics stack - https://phabricator.wikimedia.org/T176996#3643940 (10Nuria) [17:53:21] 10Analytics: Private geo wiki data in new analytics stack - https://phabricator.wikimedia.org/T176996#3643977 (10Nuria) [18:06:20] nuria_: i did, thanks! [18:06:35] phuedx: via pivot or command line? [18:07:36] pivot was great for initial discovery [18:10:01] phuedx: ok [18:10:03] and then we used the command line for drilling down into an hour or two's worth of data [18:10:29] phuedx: ok, ya, that is a more effective use of resources [18:10:35] phuedx: please evangelize in your team [18:12:27] nuria_: absolutely! [18:18:41] Shilad: i just saw your ping , sorry, something missing on irc today [18:30:28] Shilad: yt? [18:33:46] Shilad: but.. what would be X-forwarded for set to in that case? you do not really have anything but an umbrella ip [18:38:08] Shilad: are you thinking an internal ip? assigned by operator [19:05:02] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking): Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3644260 (10Nuria) @tbayer: I imaported popups tables t... [19:22:48] Shilad: That is not what you will find on the data, is empty more often than not. We can talk but I would restrict your queries to desktop if you want the signatures to work as you have them [19:32:04] Shilad: a good ball park for signatures for a domain to cross check your data should be the uniques_understimate for unique devices calculations, take a look at that data on tables on webrequest database: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Unique_Devices [20:02:34] nuria_: Sorry for not responding. I was teaching, but thanks for the info. That all makes sense. It's a bummer that there is no way to reliably follow uniques on mobile. Have you all thought about using a cookie to do so? I'm sure there are policy issues there... just curious. [20:03:30] Shilad: ya, there are huge privacy issues we do not do it on purpose [20:04:27] Shilad: for your purpose since you are looking at signatures in short term spams you can do it but it requires a more sofisticated approach than that one [20:07:48] Hi! I did try to reach druid from a SWAP Jupyter internal notebook, and got a "HTTP Error 503: Service Unavailable" [20:08:19] Here's how I tried to connect: [20:08:21] from pydruid.client import * [20:08:23] query = PyDruid( 'http://druid1001.eqiad.wmnet:8082', 'druid/v2') [20:09:28] Hmmm grepping about the refinery repo, it sez the port is 8090 [20:16:55] Hm still "HTTP Error 503: Service Unavailable" [20:22:25] However, I'm able to query Druid fine from the console on notebook1001, like so: curl -X POST 'druid1001.eqiad.wmnet:8082/druid/v2/?pretty' -H 'Content-Type:application/json' -d @test_druid_query [20:22:35] Maybe some Jupyter sandboxing or firewall? [20:22:48] joal: madhuvishy: ^ thx in advance!! :) [20:35:34] AndyRussG: I looked at it a bit before and found that I can connect to the port with something like `!nc -vz druid1001.eqiad.wmnet 8082`(tcp) from the jupyter notebook or the server [20:35:40] but couldn't curl [20:36:19] I'm not sure what's up, I left messages for andrew and joal a couple days ago in backscroll [20:37:01] I've seen the messages, but didn't investiage madhuvishy [20:53:18] nuria_: That makes sense.Can you tell me more about this idea: "requires a more sofisticated approach than that one." Doesn't sound like something I'll do, but I'm curious. [20:53:44] nuria_: And sorry for the intermittent delays... I'm holding office hours. [21:34:09] madhuvishy: joal Thanks so much!! (sorry I was away from the keyboard for a bit)... I guess there's a puppet config for this stuff somewhere, just to see if anything obvious jumps out? thx again [22:14:19] AndyRussG: say that you get to connect to druid, do you know how to query? [22:14:50] nuria_: hey... Just looking at the documentation for Druid and pydruid [22:15:14] Also scraped some config in refinery for some relevant details [22:15:39] Here's the pydruid query I tried: [22:15:40] query.timeseries( [22:15:42] datasource='pageviews-hourly', [22:15:44] granularity='day', [22:15:46] intervals = '2017-06-01T00:00/2017-07-01T00', [22:15:48] aggregations = { 'view_count': doublesum('view_count' ) } [22:15:50] ) [22:17:32] AndyRussG: and you have tried on the console with pydruid and that works too? [22:18:30] nuria_: ah no... good idea! I tried with a CURL from the console, but not pydruid [22:18:38] AndyRussG: right [22:19:10] AndyRussG: let's first try whether pydruid actually works (it might) but i do not think any of us has used it [22:20:04] K will do! [22:26:13] With the CURL it worked fine from the console, got a valid Druid data response [22:26:28] AndyRussG: ya, i just tried too [22:38:40] AndyRussG: trying pyDruid from inside jupyter notebook terminal it doesn't work i think the install needs couple more things [22:57:31] AndyRussG: i think the dependencies of pydruid require packages that cannot be easily installed on the virtualenv (from my brief tests) [22:59:27] nuria_: ah hmm interesting...!! thanks! (I'll look at it again in a few, just in a call now) [23:15:27] AndyRussG: I think pydruid requires pyobject which requires system install https://pygobject.readthedocs.io/en/latest/faq.html [23:24:54] nuria_: this didn't return any errors from within the notebook: !pip install pydruid [23:25:07] AndyRussG: right, now try to use package [23:26:29] nuria_: mmm it does stuff. For example, the query.timeseries I tried (above) did correctly convert to a valid Druid json query, which it printed out as part of the error [23:27:53] nuria_: https://tools.wmflabs.org/paste/view/8046eb00 [23:28:41] Apparently makes an http call and gets a response [23:31:14] Funny the query takes a long time to come back [23:44:18] Hi A Team :D [23:44:54] general question, how much lag in general can one expect on the mysql event logging tables? how close to real time are they kept? [23:47:35] nuria_: madhuvishy joal I was able to query Druid using pydruid form the console, using the same virtual python environment we get in the notebooks [23:47:37] https://tools.wmflabs.org/paste/view/b114d636 [23:48:04] used the packages I'd already installed from within the notebook [23:48:35] could it be something about how the query is made from the notebook that is blocked by a firewall, or some sort of notebook sandboxing?