[00:19:12] 10Analytics, 10Performance-Team, 10Research, 10Security-Team, 10WMF-Legal: A Large-scale Study of Wikipedia Users' Quality of Experience: data release - https://phabricator.wikimedia.org/T217318 (10leila) @JFishback_WMF Gilles can speak to the timelines better. From my perspective, the sooner the better... [01:42:19] woah, sorry I wasn't paying attention to IRC and missed this saga, good job you two for figuring it out [02:59:13] 10Analytics: Import 2001 wikipedia data - https://phabricator.wikimedia.org/T155014 (10Graham87) > *While I display some reluctance in that matter insofar as it applies to HomePage (which contains Wikipedia's first ever edit), I'm overall neutral as to whether holding off of creating such gaps is beneficial. Per... [03:46:42] (03PS1) 10Milimetric: Tell map to render on chart type changes [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/526301 (https://phabricator.wikimedia.org/T226514) [03:51:46] (03PS1) 10Milimetric: Remove unnecessary child reference [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/526302 (https://phabricator.wikimedia.org/T226514) [07:01:56] it was pointed out here https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#6_million_articles_already? that at least one graph has too many tick marks and/or poorly scaled axes [07:18:21] hi renner18! is it about a graph into an article? [07:18:31] if so my team doesn't handle it [07:26:30] about this one https://stats.wikimedia.org/v2/#/en.wikipedia.org/content/pages-to-date/normal|line|all|page_type~content|monthly [07:27:13] ah no that is different, we handle it :) [07:27:25] can you open a task with the tags "Analytics" and "Wikistats" ? [07:42:16] renner18: --^ [07:44:36] RECOVERY - superset.wikimedia.org HTTPS on analytics-tool1004 is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster [07:46:58] yess [07:47:45] good morning! [07:47:58] hola marcel! [07:48:11] heyy elukey :] [07:49:31] mforns: I have stopped puppet on an-coord1001 [07:49:38] the rest will be done by you :) [07:49:50] elukey, ok, wanna batcave or here? [07:49:58] I think that we can do in here [07:50:01] ok [07:50:48] so from https://yarn.wikimedia.org/cluster/apps/RUNNING it seems that we are in a good spot now [07:51:05] in theory, since I don't see much hive-related activity, we could sneak in a restart [07:51:08] BUT [07:51:12] let's do things properly :) [07:51:26] one thing that Joseph always suggests is to stop camus [07:51:46] aha [07:52:17] before the times we had to remove the entry from the crontab [07:52:22] now there is a magic [07:52:33] emmm, the backfilling for unique_devices that we launched yesterday is about to finish, should we wait a bit? [07:52:34] sudo systemctl stop camus*.timer [07:52:39] of course [07:52:45] in the meantime, let's stop camus [07:52:50] ok [07:52:54] two things [07:53:05] 1) you can see with systemctl list-timers | grep camus the current status [07:53:24] 2) you can temporarily stop camus with the above command (note the .timer at the end, it is important) [07:53:28] ok, there's several [07:53:48] with camus* you get them all [07:53:49] that's why the star [07:53:51] ko [07:53:53] exactly [07:54:11] the .timer is important since if you recall it is the trigger for the .service unit, that is the one containing the command [07:54:15] ok, done [07:54:25] so stopping the timer basically stops the recurrent action [07:54:32] I see [07:54:36] (like removing the entry from the crontab) [07:54:43] the caveat is that puppet needs to be disabled [07:54:50] aha [07:54:52] since it tries to ensure the timer [07:55:09] so if you don't disable it, it will re-add it during the next puppet run [07:55:13] ok, unique_devices is done too [07:55:28] understand [07:56:25] you could start now restarting oozie, since it should not cause any big damage [07:56:34] ok! [07:56:38] it stores its state in a database (the meta instance) [07:58:00] and you can restart oozie with [07:58:05] sudo systemctl restart oozie [07:58:18] right [07:58:39] done [07:59:02] let's log it [07:59:24] !log restarted oozie in an-coord1001.eqiad.wmnwt [07:59:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:59:57] status after restart seems ok [08:00:09] super [08:00:20] now the hive-server2 and the hive-metastore [08:00:25] aha [08:00:28] https://yarn.wikimedia.org/cluster/apps/RUNNING seems telling us that we are good to go [08:00:44] we ignore what those spark shells are doing [08:00:54] so I wouldn't kill them [08:01:03] the worst that can happen is that they fail [08:01:12] same restart command right, I'm seeing that hive-server2 is also a systemctl service [08:01:23] ok [08:01:23] exactly, let's do the metastore first [08:01:27] oh ok [08:01:30] hive-metastore [08:01:51] done [08:02:50] super [08:02:54] does it look good? [08:03:09] yes, afaics [08:03:10] we also have https://grafana.wikimedia.org/d/000000379/hive [08:03:15] status is good [08:03:28] ok good [08:04:23] oh, I can see a missing data point, but then it resumes normally [08:04:30] mforns: we were not quick enough, if you see we have refine_sanitize_eventlogging_analytics_immediate running [08:04:35] let's wait for the server [08:04:44] yeah the missing datapoint is normal [08:05:14] elukey, I haven't restarted the server yet [08:05:16] should I [08:05:17] ? [08:07:17] let's wait for the sanitize stuff to complete [08:07:21] it uses hive right? [08:08:04] ok finished [08:08:07] you can go mforns [08:08:15] ok [08:08:53] done, status ok [08:09:54] you can test quickly hive/beeline to confirm [08:10:02] and then we are done :) [08:11:10] restart camu no? [08:12:30] sure I'll do it via puppet [08:12:40] ah of course [08:12:55] also log that you restarted the metastore and hive server [08:14:07] !log restarted hive-metastore [08:14:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:14:13] !log restarted hive-server2 [08:14:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:15:15] elukey, beeline is giving me some errors... [08:15:49] what errors? [08:16:57] I tried; select * from event.navigationtiming where year=2019 limit 10; [08:17:05] and says year is not a field [08:17:28] this one works though: select * from wmf.edit_hourly where snapshot='2019-05' limit 10; [08:18:43] does it work in hive? [08:19:15] yes [08:19:40] mforns: the even.etc.. works on both hive and beeline for me [08:19:55] ???? [08:20:03] the select * from event.navigationtiming where year=2019 limit 10 [08:20:16] elukey, from what machine are you logging? [08:20:20] stat1004 [08:20:28] I'm in stat1007 [08:20:54] beeline works on it [08:21:02] just checked [08:21:04] lemme sudo as you [08:21:20] elukey, oh wow, I logged out of beeline, logged in again and it works now.. [08:21:37] ah ok :) [08:21:48] might have been a temporary glitch then [08:21:54] ok [08:22:00] all good! [08:22:07] ok! [08:24:36] gc time is high on hive no? [08:25:31] going down now [08:26:54] so a little bit of activity, especially after a restart, is normal [08:27:22] when you check graphs of GC time/runs you can start to worry when you see [08:27:22] ok [08:27:37] 1) a lot of runs 2) a lot of time spent every time (say 200ms etc..) [08:27:50] aha [08:28:19] mforns: https://phabricator.wikimedia.org/T228620 is a good reading if you have time [08:28:56] ok I see the graphs [08:29:12] so, we're done? [08:29:17] yep! [08:29:35] nicely done [08:29:51] after the first time it gets even more boring [08:29:56] welcome to opsland [08:29:56] :D [08:30:09] hehehe :D [08:30:54] do you have time to look and merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519688/ and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519690/ ? [08:31:02] I was about to ask you something [08:31:10] yes? [08:31:13] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519688/2/modules/profile/manifests/analytics/refinery/job/data_purge.pp - can we ensure => absent instead? [08:31:20] so puppet will do the clean up [08:31:52] oh, yea! [08:31:54] will do [08:32:06] thanks :) [08:32:09] the other one is ready [08:32:18] ready to merge? [08:32:52] also, just to verify, is skip-trash enabled or disabled? [08:32:53] elukey, yes :] I removed the skip-trash arg [08:32:57] ack :) [08:32:59] merging! [08:35:13] thaanks! [08:35:20] will check this is working in the next hours [08:36:43] done! [08:36:55] puppet already ran on an-coord1001 [08:37:24] merging also the second one [08:39:22] ok :] [08:41:06] thanks for everything, elukey, I will return later in the day :] [08:42:02] removed! [08:42:07] mforns: ahahah don't even say that! [08:42:21] you helped me a lot already [08:42:28] oh elukey you want me to create another patch to remove the timer code? [08:42:29] sometimes I am happy if I return something :) [08:42:40] mforns: if you have time later on, not really urgent [08:42:49] it is annoying the two patch step I know [08:42:53] no no [08:42:58] but it guarantees that we clean up properly :( [08:47:04] elukey, https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/526386/ [08:48:53] thaaanks, see ya later! [08:49:04] ack! [10:06:03] mforns: thank you for doing the work with uniques yesterday. But it seems backfilling didn't load per family values? I just looked in cassandra and values per family are still until nov 2018 [10:07:46] I think this is has to be a different issue to the duplicates one, because otherwise the initial backfilling wouldn't have loaded values up to nov 2018 right? [11:40:49] * elukey lunch! [12:21:45] hey fdans I think it's because we only loaded daily counts, I still have to backfill monthly [12:21:52] daily values are there [13:04:55] fdans: what was your ssh connection killer? [13:20:45] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Editing - https://phabricator.wikimedia.org/T226855 (10Neil_P._Quinn_WMF) p:05Triage→03Normal [13:20:52] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Language - https://phabricator.wikimedia.org/T226856 (10Neil_P._Quinn_WMF) p:05Triage→03Normal [13:32:01] 10Analytics, 10Editing-team: Deletion of limn-edit-data repository - https://phabricator.wikimedia.org/T228982 (10Neil_P._Quinn_WMF) Thanks for the ping @Jdforrester-WMF! @mforns, I have a feeling that we can get rid of everything you mention. However, where should I look to see them? I don't see them in the... [13:33:59] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) [13:35:44] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) 05Stalled→03Open [13:47:52] hey a-team, are we doing retro today? [13:47:55] (03PS1) 10Mforns: Remove duplicates from unique_devices_per_project_family_monthly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/526425 (https://phabricator.wikimedia.org/T229254) [13:48:10] (if not I will plan my afternoon differently) [13:50:36] not sure, in theory we could do it since only Joseph is out [13:51:19] ottomata, theoretically we're only missing Joseph, right? So yes. But I don't mind skipping today. [13:51:34] ah you beat me [13:51:35] we haven't done it for a while ya? [13:54:20] PROBLEM - Check the last execution of refine_eventlogging_eventbus on an-coord1001 is CRITICAL: NRPE: Command check_check_refine_eventlogging_eventbus_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:54:54] (03PS2) 10Mforns: Remove duplicates from unique_devices_per_project_family_monthly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/526425 (https://phabricator.wikimedia.org/T229254) [13:55:20] ^^ interstingm, i just removed that job [13:55:29] i guess puppet needs to run on icinga host [14:00:49] was it removed? [14:00:54] if so yes :) [14:01:16] yup it was! [14:01:24] and puppet run on icinga remoevd it from checks [14:02:10] ottomata: I have some buster questions when you have 10 mins [14:02:54] ya now is good luca [14:03:40] ok! So I applied the role to stat1005 and imported the cdh packages for buster [14:04:08] but I have a couple left to do [14:04:28] spark2 is the first, I am wondering if it needs to be built with a special procedure or a copy from stretch would be fine [14:05:00] second is snakebite - I tried to build it for buster from the master branch (that seemed more recent than debian/wmf but I could be wrong) [14:05:06] do we need it? [14:05:15] IIRC we were thinking about not using it anymore [14:05:18] i think we should stop using snakebite (do we actually use it?) [14:05:20] yeah, it is not python3 [14:05:21] so [14:05:29] heh, luca, maybe we shouldn't even install python2 :o [14:05:33] on buster nodes! [14:05:41] that would be awesome yes [14:05:43] for spark2: there shouldn't be anything special [14:05:50] its all just packaged .jars [14:05:54] ah then I can copy [14:06:05] i think so [14:06:26] all right [14:06:30] hmm is that true? [14:06:37] we do want to upgrade spark and were waiting for buster... [14:06:39] lemme find ticket [14:09:38] elukey: what version of python3 is installed with buster? [14:09:57] (03CR) 10Mforns: [V: 03+2] Remove duplicates from unique_devices_per_project_family [analytics/refinery] - 10https://gerrit.wikimedia.org/r/526252 (https://phabricator.wikimedia.org/T229254) (owner: 10Mforns) [14:10:25] (03CR) 10Mforns: [V: 03+2] Remove duplicates from unique_devices_per_project_family_monthly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/526425 (https://phabricator.wikimedia.org/T229254) (owner: 10Mforns) [14:10:35] i think spark2 will need rebuiltt [14:11:02] 3.7 IIRC [14:11:05] we install a pyarrow .whl file which is downloaded / created using pip [14:11:15] and pip will create it for whatever python version it is running [14:11:18] i'm not sure if that will be a problem [14:11:23] but the wheel is for cp35 [14:11:25] python 3.5 [14:12:46] yeah let's rebuild to be sure [14:19:34] (03CR) 10Ottomata: "Tested. This is ready for review." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525435 (https://phabricator.wikimedia.org/T227896) (owner: 10Ottomata) [14:46:53] Trash [14:47:02] oops, wrong place [15:16:15] elukey, is it possible that the deletion script that we deployed this morning already run? it is scheduled in puppet at 4:25am... [15:16:37] but by the contents of the Trash folder, it seems it has executed around the time after we merged it [15:17:06] mforns: have you checked with systemctl list-timers? [15:17:15] it should tell you the last exec time [15:17:18] nope :[ doing [15:17:22] ok [15:18:36] ottomata: attempting to roll restart jumbo again ok? [15:18:42] (with the new version of the cookbook) [15:18:43] yuppers [15:19:26] elukey, yes it did execute [15:19:41] mforns: was it bad? [15:20:30] elukey, no no, all expected, worked fine, but it should have executed at 4:25am instead [15:22:06] ah good [15:55:16] weird, it seems that kafka preferred-replica-election doesn't return zro [15:55:38] hm that is weird [15:56:01] I tried now and it worked [15:56:08] I mean, on a shell on kafka-jumbo [15:56:16] but failed in the cookbook [15:56:23] so I suspect I mispelled the command [15:57:21] mmm it seems correct [15:57:27] then there must be something using cumin [16:00:34] ping ottomata mforns [16:02:09] ah ottomata when cumin executes it misses the env variables and doesn't set --zookeeper-etc.. [16:10:13] ahhh! makes sense! [16:10:20] you can source them ya? [16:13:05] will try yes! [16:23:52] (03PS1) 10Ladsgroup: Use the internal WDQS endpoint instead [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/526471 (https://phabricator.wikimedia.org/T214894) [16:24:12] (03CR) 10jerkins-bot: [V: 04-1] Use the internal WDQS endpoint instead [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/526471 (https://phabricator.wikimedia.org/T214894) (owner: 10Ladsgroup) [17:02:30] ottomata: I am restarting jumbo again, the cookbook now works :) [17:06:57] (03CR) 10Nuria: [C: 03+2] Remove duplicates from unique_devices_per_project_family_monthly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/526425 (https://phabricator.wikimedia.org/T229254) (owner: 10Mforns) [17:17:29] yeehaw elukey ! [17:18:07] ottomata: \o/ tomorrow I'll create another one for mirror maker and possibly zookeeper [17:24:49] going afk for a bit, will check later the cookbook for jumbo running! [17:27:07] elukey is loving the cookbooks! [18:28:57] elukey, if you have time at some point today, can you merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519691/ ? It's the next deletion script patch, it should trigger at 2am tomorrow, thanks! don't worry if you can't [18:29:19] (I removed the skip-trash arg) [18:29:58] mforns: I can! But it will probably trigger the command, is it a problem? [18:30:17] elukey, no, for today it won't delete anyting more [18:30:26] you can merge no prob [18:30:43] ack, doing it! [18:30:46] :] [18:32:39] (03CR) 10Nuria: Remove unnecessary child reference (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/526302 (https://phabricator.wikimedia.org/T226514) (owner: 10Milimetric) [18:35:12] mforns: merged + puppet run on an-coord1001 [18:35:21] elukey, thanks! [18:40:39] * elukey afk again :) [18:41:14] roll restart of kafka jumbo via cookbook completed! metrics look good [18:43:43] (03CR) 10Mforns: [C: 03+1] "LGTM, but waiting for points brought up by Nuria to be discussed." (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525599 (https://phabricator.wikimedia.org/T226850) (owner: 10MNeisler) [18:57:52] (03PS2) 10Milimetric: Remove unnecessary child reference [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/526302 (https://phabricator.wikimedia.org/T226514) [18:57:59] (03CR) 10Milimetric: Remove unnecessary child reference (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/526302 (https://phabricator.wikimedia.org/T226514) (owner: 10Milimetric) [19:08:25] (03CR) 10MNeisler: Hash temporary identifiers in web team schemas (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525599 (https://phabricator.wikimedia.org/T226850) (owner: 10MNeisler) [19:08:46] (03CR) 10Nuria: [C: 03+2] Remove unnecessary child reference [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/526302 (https://phabricator.wikimedia.org/T226514) (owner: 10Milimetric) [19:09:17] nuria: that won't merge until the parent change is reviewed, the one that fixes the map, just fyi [19:10:01] milimetric: yaya, i am working my way [19:10:06] sorry :) [19:10:11] milimetric: jaja [19:10:27] milimetric: the change works i tested that , just need to remember vue to be able to truly cR [19:10:57] np, take your time [19:12:26] (03CR) 10Nuria: [C: 04-1] Hash temporary identifiers in web team schemas (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525599 (https://phabricator.wikimedia.org/T226850) (owner: 10MNeisler) [19:14:39] (03CR) 10Nuria: [C: 03+1] "Sorry, mean to +1" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525599 (https://phabricator.wikimedia.org/T226850) (owner: 10MNeisler) [20:11:06] milimetric: I am going to try to move bundles around in the map [20:11:29] milimetric: cause map start s displaying at 1.17 secs [20:11:36] milimetric: on my desktop [20:11:41] nuria: sure, go for it [20:11:46] It’s loading a lot [20:45:10] hey, heads-up - we're going to start sending events to new schema with next SWAT window. We will send events to new MobileWebUIActionsTracking schema. The expected events count is the same as we do for MobileWebMainMenuClickTracking [20:45:30] context - now we have MobileWebMainMenuClickTracking, but because now we have more menus (not only main), plus we would like to track some additional interactions with the mobile web ui, we introduced new schema `MobileWebUIActionsTracking` [20:45:39] bug: https://phabricator.wikimedia.org/T220016 [20:46:00] Traffic to existing schema https://grafana.wikimedia.org/d/000000566/overview?panelId=5&fullscreen&orgId=1 (similar traffic expected for new one) [20:46:10] that at this moment tracks same amount stuff as MobileWebMainMenuClickTracking. That's tempoorary, once we prove that new schema tracks everything what we need we will decommision the MobileWebMainMenuClickTracking [20:46:40] thx jdlrobson for link to phab and grafana :) [20:57:24] ^ nuria and ottomata when you have a moment. BBIAB [20:57:27] raynor: ok [20:57:34] jdlrobson: ok [21:32:21] ok looks ok i think , right? not a huge increase in traffic [21:32:26] gotta run bbbbye! [22:35:04] nuria: who can I ask on your team for 30-min of their time to walk us through how to do resume screening for the software engineering position? I would love to be able to set up a time before Friday with the person(s) and those on our end who will do resume screening. [23:50:09] (03CR) 10EBernhardson: swift-upload.py to handle upload and event emitting (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525435 (https://phabricator.wikimedia.org/T227896) (owner: 10Ottomata) [23:56:02] leila: probably marcel? [23:56:21] nuria: ok. We will ask for his time then. thanks!