[07:13:24] hellooo team :] [08:11:02] hello!! [08:12:55] mforns[m]: hola! [08:18:02] all good? [08:22:31] whenever you are ready to go I'd need to discuss with you T168414 :) [08:22:32] T168414: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414 [08:22:54] !log restart druid coordinators to pick up new jvm settings (freeing up 6GB of used memory) [08:22:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:54:41] Morning a-team [08:55:02] Happy new year analytics folks :) [08:57:35] o/ [09:13:52] afk for a coffee! (new cooworking space, hope that the coffee shop downstairs is good :P) [09:16:07] 10Analytics: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3866459 (10jcrespo) On a related note, #Analytics team: what is the plan if dbstore1002 fails, now that there is not a db1047 with s1/s2 and people not wanting to use codfw hardware in the past? [09:27:24] * joal hopes elukey's coffee was good [09:27:36] /awa/awa/awa [09:27:40] ufff [09:27:54] looks like yes :D [09:28:14] ahahah haven't tried it since a friend brought a ton of coffee from home :D [09:28:33] we are also testing the internet connection, now it should be good :D [09:28:40] Ah, I'll have to try that again tomorrow then :) [09:28:52] tomorrow??? [09:29:07] in 2h probably :D [09:29:10] hehehe [09:29:34] I forgot your need for coffee was sensibly equal to mine [09:29:40] :D [09:31:25] elukey: from charts, druid and AQS behave correctly, do you agree? [09:32:48] yep! I found a little visualization/metric bug [09:32:56] Arf [09:32:59] that is https://grafana.wikimedia.org/dashboard/db/prometheus-druid?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=All&panelId=19&fullscreen [09:33:24] Wow [09:33:42] but nothing super big, all the coordinators are reporting the available segments (so 3 times the real number) [09:34:12] hm, not sure what this is actually [09:39:23] IIRC only the coordinator leader publishes the available segments [09:40:35] hm, and that would mean we overcount by 3 the total number of segments? [09:41:08] elukey: Have you taken any action on druid today? [09:41:24] !log restart druid coordinators to pick up new jvm settings (freeing up 6GB of used memory) [09:41:28] Oh yes, just saw the log line - coordinator restart [09:41:33] yessss [09:41:39] hm, weird [09:42:46] I think that this happens: when I restart a coordinator, either it or another one becomes the leader, that publishes the metrics [09:43:27] when another coordinator is restarted, a new election might happen and another one starts publishing the metrics, meanwhile the other one stops completely [09:43:34] so the last value remains [09:43:57] also elukey: This chart tells us we are under-optimising segment size for popups schema [09:44:01] I hear you [09:44:26] the under optimization part is not clear to me since I am a bit ignorant in the subject :D [09:46:34] so in this case the solution would be to restart the druid exporter [09:46:46] that's annoying [09:46:55] :( [09:47:09] I confirm the new value is 3*real [09:47:34] elukey: underoptim --> popups datasource is ~300Mb large [09:49:00] It is spread over 3400 segments, making an average of 80kb per segment - Very much too small [09:49:27] a lot of overhead on coordinators maintaining segments for small data volume [09:52:16] ahhh okok [09:53:52] elukey: I also think we could give some room to miffle managers: the diff between used and max is not big, and we give tehm super small [09:54:12] elukey: While it's the opposite on overlord [09:59:08] 10Analytics-Kanban, 10DBA: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3866567 (10elukey) >>! In T168414#3866296, @Marostegui wrote: > I have not reviewed the full list of ALTERs, you have way more knowledge than me about what is needed on those tables :-) > But yes,... [10:00:48] joal: yep, a lot of space for tuning.. I am also curious to see results of a re-run of Dan's load test [10:01:06] elukey: I did last year - No big change [10:01:44] elukey: I however tried to push it: instead of using 100 parallel workers for requests, I went up to 200, then 500 [10:01:59] With cache warmed, druid handles easily [10:02:47] elukey: meaning, it doesn't go faster per-request, but handles a lot of requests in parallel [10:05:43] nice :) [10:42:43] 10Analytics: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3866622 (10elukey) >>! In T183771#3866459, @jcrespo wrote: > On a related note, #Analytics team: what is the plan if dbstore1002 fails, now that there is not a db1047 with s1/s2 and people not wanting to use codfw hardwar... [10:45:37] this is not good people --^ [10:45:56] O.o [10:46:46] mforns: o/ - whenever you have time can you review https://phabricator.wikimedia.org/T168414 ? [10:46:56] elukey, now :] [10:47:02] if you are ok with the plan I'll schedule maintenance asap :) [10:47:46] (03PS1) 10Zhuyifei1999: worker.py: session close take a higher prio, ignore error 2013 in cur close [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/401486 (https://phabricator.wikimedia.org/T172143) [10:48:01] (03CR) 10Zhuyifei1999: [C: 032] worker.py: session close take a higher prio, ignore error 2013 in cur close [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/401486 (https://phabricator.wikimedia.org/T172143) (owner: 10Zhuyifei1999) [10:48:16] (03Merged) 10jenkins-bot: worker.py: session close take a higher prio, ignore error 2013 in cur close [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/401486 (https://phabricator.wikimedia.org/T172143) (owner: 10Zhuyifei1999) [11:00:50] elukey, yea, forgot about the alter tables needed, thanks for looking into that! [11:01:01] commented on task [11:02:54] \o/ [11:03:24] mforns: I'd sent an email to analytics@ explaining that we'll need to stop the mysql consumer for a couple of days probably, starting tomorrow [11:03:40] then we'll need to re-run the cleaner again from the start [11:03:43] and we'll be done [11:04:28] elukey, oh, it takes a couple days, didn't remember that [11:05:40] elukey, do you recall how much it took for the old machines? [11:06:46] it is likely that with the master database being already partially purged, and also given the higher performance of the boxes, this process is faster this time... [11:07:37] but yea, I guess stopping the consumer for 1 or 2 days is not that bad, given that kafka will buffer everything [11:10:09] Hey guys, will kill/restart cassandra bundle to take into account new top_bycountry job [11:10:17] anything against me doint it now? [11:11:26] elukey, mforns --^ ? [11:11:49] not on my side :] [11:12:45] same from me :) [11:12:56] thanks guy, moving on :) [11:13:03] s/guy/guys [11:13:36] mforns: iirc we had to skip some huge tables on the old hosts since their disks were not performing well, so I'd expect a much fast experience this time.. [11:13:39] !log Kill and restart cassandra loading oozie bundle to pick new pageview_top_bycountry job [11:13:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:29:21] 10Analytics-Kanban, 10DBA: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3866794 (10elukey) Sent an email to announce maintenance for tomorrow (Jan 03). [11:46:45] * elukey lunch! [12:04:30] starting a bit late today, hellooooo and happy new year a-team! [12:08:17] Heya fdans :) [12:15:18] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965#3676320 (10mforns) I guess T155507 can help in identifying bad runs of mediawiki history reconstruction. [12:27:40] hey fdans!!! [12:31:03] ha ha new year! [12:35:01] mforns: ಹೊಸ ವರ್ಷದ ಶುಭಾಶಯ! [12:35:22] I'm sure you can read that by now [12:35:25] xD I only see hexa chars [12:35:32] goddammit linux [12:35:36] hehehe [12:35:49] 10Analytics, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3802296 (10MoritzMuehlenhoff) These hosts still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/) [12:36:24] https://usercontent.irccloud-cdn.com/file/lhPZqxtq/Screen%20Shot%202018-01-02%20at%2013.36.01.png [12:36:45] mforns: that's happy new year in Kannada, according to google translate [12:37:35] fdans, coool! well... I still don't read or speak or understand kannada :[ [12:39:18] 10Analytics, 10DC-Ops, 10Operations, 10ops-codfw: Decomission eventlog2001 - https://phabricator.wikimedia.org/T182397#3866939 (10MoritzMuehlenhoff) 05Resolved>03Open This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/) [13:05:39] 10Analytics-EventLogging, 10Analytics-Kanban: Remove EL capsule from meta and add it to codebase - https://phabricator.wikimedia.org/T179836#3866987 (10mforns) a:03mforns [13:07:32] elukey: Do you have that link you sent me the other day about alerting using prometheus metrics? [13:08:36] joal: should be 'monitoring::check_prometheus' in puppet [13:08:49] elukey: Tjanks !! [13:08:52] s/j/h [13:24:32] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3867003 (10elukey) Tried to check the ipmi command that the icinga check calls: ``` elukey@analytics1037:/var/log$ sudo ipmimonitoring -v | grep -i power 83 | PS Red... [14:18:52] 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: Upgrade Analytics Cluster to Java 8 - https://phabricator.wikimedia.org/T166248#3867115 (10elukey) Some weeks ago we discussed this task during the analytics ops meeting. We shouldn't aim to test refine jobs in labs but only the upgrade procedure itself... [14:19:49] elukey: --^ ? I thought we said it would be good to test jobs in labS? [14:20:00] I'm sorry if I don't recall correctly [14:20:03] :S [14:20:39] lol i recall the opposite, sorry! [14:20:57] HELLLOOOOO :D [14:21:07] i think we said we would test just the refine ones, and not worry about the rest [14:21:08] hello Andrew! Happy new year :) [14:21:15] HAPPY NEW YEAARR MY ATEAAAM [14:21:26] Happy new year mate :) [14:21:28] ah ok I'll add the note for the refine jobs [14:22:07] ottomata: when we say refine, we actually mean camus+webrequest-refine, or just webrequest-refine? [14:23:10] i think it'd be good to test camus, so, and we only need a few records. i have a kafka and varnishkafka setup in analytics labs right now, we can camus into java 8 hadoop cluster there and use that for test [14:23:36] That would great ottomata, +1 for that [14:24:22] all right, so I'll bring up a small cluster in there then [14:33:48] taking a break a-team, I'll be back for standup [14:34:29] Happy New Year everyone :) [14:34:33] hi ottomata!! ! ! ! happy new year :] [14:35:35] o/ [14:36:43] Hi milimetric ! Happy new year :) [14:40:31] hey milimetric !!!!! ha ha new year! [14:41:29] hi mforns :) I saw Last Jedi [14:41:34] amazing [14:41:38] xDDD [14:41:46] you liked? [14:42:03] POTENTIAL SPOILER ALERT DETECTED [14:42:14] hehehe [14:44:04] no spoilers, just that I thought it was so great :) [14:44:35] i saw too! [14:44:49] it was only great for the last 1/3 [14:44:56] first 2/3 SUUUCKED [14:44:57] haha [14:46:41] hehe [14:47:02] I liked 1 of the 3 plot lines, the other 2 not really :] [14:47:39] 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: Upgrade Analytics Cluster to Java 8 - https://phabricator.wikimedia.org/T166248#3867220 (10elukey) The java upgrade part should be something like the following executed on all the analytics hadoop hosts: 1) apt-get update && apt-get install openjdk-8-j... [14:57:48] ottomata: you were really close to the allowed spolier level :D [14:58:44] haha [15:09:55] 10Analytics-Kanban, 10Operations, 10ops-eqiad: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3867285 (10Ottomata) > Alternatively, not purchase anything- being fully aware that if it fails, there is no backup service available Most of this data is now available for SQL query... [15:11:38] * elukey coffee! [15:12:21] dunndunndundunnnnn! i have successfully marked all holiday email as read and/or actually read them [15:16:51] :) [15:17:12] I miraculously had no mail except Luca's from this morning [15:17:50] milimetric: https://gerrit.wikimedia.org/r/#/c/395917/2 Shall I +1 this? [15:17:54] sorry [15:17:56] shall I merge this? [15:18:00] lookin [15:18:15] yes [15:18:26] that would be appreciated, it was the last thing I did before break [15:18:31] thanks! [15:19:58] 10Analytics-Kanban, 10Patch-For-Review: https://dumps.wikimedia.org/other/pageviews/ needs a README - https://phabricator.wikimedia.org/T167033#3315161 (10Milimetric) [15:24:00] milimetric: done [15:25:13] yay it works https://dumps.wikimedia.org/other/pageviews/readme.html [15:35:00] hi, happy new year :D [15:35:32] ottomata: when you say that dbstore1002's data is available in Hive do you mean mw history reconstruction? (asking to clarify my ideas) [15:45:50] OH no sorry! shoulda read more there, was assuming yall were talking about eventlogigng stuff, i think i was thikning of old db1047 [15:45:54] will correct [15:46:03] ahhh okok! :) [15:46:47] 10Analytics-Kanban, 10Operations, 10ops-eqiad: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3867410 (10Ottomata) DOH Ignore ^ I for some reason was thinking yall were talking about eventlogging, not MW analytics slave dbs. Carry on! [15:48:13] 10Analytics, 10Analytics-Wikistats: When searching for a project language, display a full list of languages - https://phabricator.wikimedia.org/T182960#3867412 (10Milimetric) Agreed that people would have an easier time with a bigger control. Perhaps we can make it the way we want on desktop and then compromi... [15:56:21] 10Analytics-Kanban, 10Patch-For-Review: Druid Woes - https://phabricator.wikimedia.org/T183273#3867427 (10elukey) Restarted all the druid coordinators to free more used memory on both clusters, everything looks good. Metrics have been stable and fine up to now, plus Joseph re-ran before the holidays the load t... [16:00:21] milimetric: Thanks a lot for he readme files ! Dumps navigation looks way better now :) [16:02:00] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3867475 (10RobH) It seems odd that the harware says its fine, but the software check doesn't. I'd rather we not close the task if its showing the alarm, but leave it o... [16:07:29] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3867494 (10elukey) 05Open>03stalled [16:14:54] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3867544 (10Ottomata) > Is restricting protocol to TLSv1.2 explicitly worth any gain on top of the above, given librdkafka has no config for i... [16:49:48] oo elukey, kafka 1.0 in confluent 4, should have .deb for it! [16:50:01] might want it before more productionization of jumbo, will look into it [16:52:31] \o/ [16:53:28] kafka 1.0 :) [16:53:39] I think I had even never thought of that :) [16:53:45] elukey: https://grafana-admin.wikimedia.org/dashboard/db/prometheus-druid?panelId=44&fullscreen&orgId=1&from=now-5m&to=now&refresh=1m [16:53:58] elukey: in edit mode, to see metric definition [16:56:16] joal: seems good! [16:56:45] elukey: And I assume we send warnings if less than 10, and error if 0? [16:57:25] yeah [17:11:37] (03PS1) 10Addshore: Fix IRC count of wikidata [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/401543 (https://phabricator.wikimedia.org/T165463) [17:11:42] (03PS1) 10Addshore: Fix twitter count grep [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/401544 (https://phabricator.wikimedia.org/T165463) [17:11:45] (03CR) 10Addshore: [C: 032] Fix twitter count grep [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/401544 (https://phabricator.wikimedia.org/T165463) (owner: 10Addshore) [17:11:50] (03CR) 10Addshore: [C: 032] Fix IRC count of wikidata [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/401543 (https://phabricator.wikimedia.org/T165463) (owner: 10Addshore) [17:11:55] (03Merged) 10jenkins-bot: Fix twitter count grep [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/401544 (https://phabricator.wikimedia.org/T165463) (owner: 10Addshore) [17:11:59] 10Analytics, 10Operations, 10ops-eqiad: rack/setup/install noteboot100[34] - https://phabricator.wikimedia.org/T183935#3867856 (10RobH) p:05Triage>03Normal [17:12:01] (03Merged) 10jenkins-bot: Fix IRC count of wikidata [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/401543 (https://phabricator.wikimedia.org/T165463) (owner: 10Addshore) [17:16:42] 10Analytics, 10Operations, 10ops-eqiad: rack/setup/install noteboot100[34] - https://phabricator.wikimedia.org/T183935#3867886 (10RobH) [17:25:41] 10Analytics-Kanban, 10Operations, 10ops-eqiad: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3867954 (10elukey) I've discussed this task with my team and a couple of things came up: 1) The host is two months from being OOW, so getting a replacement if it breaks might become... [17:30:54] 10Analytics, 10Operations, 10ops-eqiad: rack/setup/install noteboot100[34] - https://phabricator.wikimedia.org/T183935#3867970 (10Ottomata) Let's do Stretch. [17:30:59] 10Analytics, 10Operations, 10ops-eqiad: rack/setup/install noteboot100[34] - https://phabricator.wikimedia.org/T183935#3867971 (10Ottomata) [17:31:08] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Alert user about adblocker preventing AQS requests - https://phabricator.wikimedia.org/T177491#3867973 (10Nuria) [17:31:10] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Review the alert message about adblocker preventing AQS requests - https://phabricator.wikimedia.org/T182958#3867972 (10Nuria) 05Open>03Resolved [17:31:28] 10Analytics-Kanban, 10Documentation, 10Patch-For-Review: Document Dashiki - https://phabricator.wikimedia.org/T182477#3867975 (10Nuria) 05Open>03Resolved [17:47:55] * elukey off! [18:38:05] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3868327 (10MoritzMuehlenhoff) >>! In T182993#3867544, @Ottomata wrote: > K! Kafka [[ https://github.com/apache/kafka/blob/trunk/clients/src/... [18:45:06] 10Analytics-Kanban: Document mediawiki history reduced table - https://phabricator.wikimedia.org/T183951#3868370 (10Nuria) [18:45:50] 10Analytics-Kanban: Document mediawiki history reduced table - https://phabricator.wikimedia.org/T183951#3868381 (10Nuria) a:03JAllemandou [19:10:57] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3868701 (10Pchelolo) I've looked over all the logs and graphs we've acquired over the past several weeks and found no indications of issues. This makes... [19:41:13] Pchelolo: hEYYY new ksql works better! [19:41:36] stat1005:/home/otto/ksql/bin/ksql-cli --bootstrap-server kafka-jumbo1002.eqiad.wmnet:9092 [19:41:44] hi ottomata happy NY :) [19:41:48] happy nY! [19:41:55] i've not played with it much, but i've at least created a stream and selected some values from it [19:41:56] I can build a newer version [19:42:00] i just built from master [19:42:06] oh ok cool [19:42:09] you can use it out of my home dir there ^^^ if you wanna try [19:42:12] v0.3 didn't build [19:42:18] kk [19:42:45] thank you for letting me know [19:48:11] :) [19:50:02] It's out: https://blog.wikimedia.org/2018/01/02/wikistats-2/ [19:50:07] 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3868934 (10Slaporte) We are discussing the data publication over email, and Justin is still helping resolve a bug in the report that started on December 1. There is still a need for data access... [20:14:12] 10Analytics-Data-Quality, 10Analytics-Kanban, 10Datasets-Webstatscollector, 10Language-Team, and 5 others: Investigate anomalous views to pages with replacement characters - https://phabricator.wikimedia.org/T117945#3869067 (10Nuria) Agreed @mforns [20:21:28] :D [20:28:13] 10Analytics: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990#3869129 (10Nuria) [20:59:14] Gone for tonight guys :) [21:14:42] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Understand Kafka ACLs and figure out what ACLs we want for production topics - https://phabricator.wikimedia.org/T167304#3869321 (10Ottomata) Today I also found that we needed kafka acls --add --deny-principal User:ANONYMOUS --... [21:32:17] 10Analytics: continue to improve computation for pages, deletion/restores - https://phabricator.wikimedia.org/T183975#3869408 (10Aklapper) [21:52:23] 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3869621 (10Nuria) We can extend access until April, now , more work needs to happen to make those reports public, we neither seen the data nor the process by which it is harvested. [21:53:36] 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3869669 (10Nuria) @Jdcc-berkman: you need to open a new ticket to ops requesting access until April. https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Production_access [21:54:50] 10Analytics, 10Analytics-Wikistats: Questionable metrics from Wikistats 2.0 Alpha - https://phabricator.wikimedia.org/T184011#3869685 (10Pine) [21:57:54] 10Analytics, 10Analytics-Wikistats: Questionable metrics from Wikistats 2.0 Alpha - https://phabricator.wikimedia.org/T184011#3869724 (10Pine) [22:21:11] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Understand Kafka ACLs and figure out what ACLs we want for production topics - https://phabricator.wikimedia.org/T167304#3869804 (10Ottomata) Added documentation here: https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_... [22:44:15] 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3869876 (10Nuria) @Jdcc-berkman Just looked at the tools that you are using to generate the report (@stat1005:/home/jdc) and they are fine and dandy for a prototype. In order to produce recurr...