[06:47:54] Good morning [07:53:13] morning! [07:53:20] late start for me this morning :) [07:53:32] :) [07:54:56] I'm still in fight elukey - Obviously when all was good yesterday, I found the hidden plague :S [07:55:00] so joal another person from imply.io answered in druid-users, but he asked if we disable the datasource first etc.. linking the tutorial that you followed :) [07:55:11] :) [07:55:19] joal: with mediawiki-history? [07:55:23] yessir [07:57:52] ah snap [07:58:14] elukey: I'm happy to have found it before deploy, but still, not cool :S [08:00:56] elukey: I'll ask for special authorization from my beloved ops to deploy on a friday [08:01:20] sure makes sense! [08:04:40] (03PS2) 10Joal: Fix mediawiki-history-page create event [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) [08:22:25] just created https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration#Restart_Hue [09:14:56] https://github.com/EDS-APHP/py-hdfs-mount [09:15:18] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Allow all Analytics tools to work with Kerberos auth - https://phabricator.wikimedia.org/T226698 (10elukey) [10:15:49] I didn't know french-hospitals were into big data ;) [10:21:09] these french people, always obsessed with data! [10:21:10] :P [10:21:17] hihi [10:21:50] just moved two hdfs crons on an-master1002 to timers [10:22:03] \o/ [10:22:04] (fetch fsimage + prune old images) [10:22:11] MOAR timers [10:24:56] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Allow all Analytics tools to work with Kerberos auth - https://phabricator.wikimedia.org/T226698 (10elukey) [10:42:19] PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [10:42:43] this is no bueno [10:43:20] (03PS3) 10Joal: Fix mediawiki-history-page create event [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) [10:46:01] https://stream.wikimedia.org/v2/stream/recentchange returns 502 from nginx [10:46:44] aouch :S [10:46:56] elukey: Can I do anything to help? [10:47:13] trying to check where the problem lies, the backend looks good [10:47:56] elukey: have you had alarms? I don't see any [10:48:27] 12:42 PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL: [10:48:30] No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 [10:48:33] seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [10:48:36] * joal looks just a bit ahead and say sorry [10:48:36] joal: --^ [10:48:40] ah okok :) [11:43:11] RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [11:45:36] super weird [11:49:58] 10Analytics, 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) [11:50:05] joal: --^ [11:50:08] super weird [11:50:38] so this morning I noticed that eventstreams in codfw didn't work [11:50:49] (it is active/active, so we serve from both eqiad and codfw) [11:51:05] so I rolled restart eventstreams on scb2* nodes [11:51:07] (codfw ones) [11:51:17] then a slow creep up for the eqiad nodes happened [11:51:21] not sure why [11:51:26] up to my restart [11:51:33] and now things are stable [11:51:48] this might be due to varnish throttling, but I don't have more ideas/proofs [11:51:51] :S [11:53:37] going to lunch! [12:23:11] hey milimetric - would you by any chance be nearby? [12:44:19] (03PS4) 10Joal: Fix mediawiki-history-page create event [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) [12:45:47] 10Analytics, 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Samwalton9) Looks good to me, thanks! [13:04:27] joal: hey [13:04:31] all yours [13:04:32] Hi milimetric :) [13:04:38] batcave? [13:04:41] yes omw [13:08:15] milimetric: will you review my jsonschema-tools code? you can also try it out [13:08:36] easiest to review here https://github.com/wikimedia/jsonschema-tools/pull/1 [13:13:13] ottomata: o/ [13:13:18] o/ [13:13:43] did you see my ping in _security? [13:14:04] we are discussing https://phabricator.wikimedia.org/T226808 [13:14:12] if you have a minute can you join? [13:31:26] 10Analytics, 10EventBus, 10Operations, 10Core Platform Team Backlog (Watching / External), and 2 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10Ottomata) Hm @herron, today we experienced {T226808}, which I think is... [13:38:55] My Hive query fails because of running out of memory on notebook1004. Can anyone recommend anything I can do so it doesn't or should I change the query from looking at a month to looking at a day? https://gist.github.com/bearloga/e37a0d985be2c0c6ff52be84cf96a38c [13:39:59] bearloga which kernel are you using? [13:55:30] ottomata: kernel? [13:57:26] using hive, not beeline or spark (if that's what you're asking?) [14:03:45] ottomata: oh duh jupyter kernel. I'm using the R one [14:04:35] but our R package queries hive through the `hive` command [14:06:14] sorry ottomata going to focus on the mediawiki history stuff a bit [14:09:09] milimetric: no hurry on that at all! [14:10:26] bearloga: try [14:10:26] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Hadoop_containers_run_out_of_memory [14:10:41] althhough i think more thhan 4GB might be needed for yours [14:10:57] i think the container limits have been increased to 4G by default anyuway [14:12:12] I am wondering if we should start advertising the fact that when kerberos will be enabled, beeline/hive-server-2 will need to be used (since they will be the only ones allowing auth) [14:15:19] 10Analytics, 10User-Elukey: Show IPs matching a list of IP subnets in Webrequest data - https://phabricator.wikimedia.org/T220639 (10faidon) 05Resolved→03Open So, a few things: - There is a better source for this kind of data, that is updated hourly rather than monthly: https://as286.net/data/ana-invalids.... [14:16:54] PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [14:19:48] elukey: +1 [14:22:31] ottomata: thank you! that fixed it [14:31:49] 10Analytics, 10Analytics-Kanban: Fix Hive partition thresholding in refinery-drop-older-than - https://phabricator.wikimedia.org/T226835 (10mforns) [14:36:02] (03PS1) 10Mforns: Fix daily and monthly partition thresholding [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) [14:38:43] (03CR) 10Mforns: [V: 03+2] Fix daily and monthly partition thresholding [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) (owner: 10Mforns) [14:57:10] (03CR) 10Ottomata: [C: 03+1] "Nice thank you!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) (owner: 10Mforns) [15:10:42] (03CR) 10Milimetric: "Since these comments are all about comments, I'll make the changes and send the patch." (037 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal) [15:12:55] I am checking my spark job via https://yarn.wikimedia.org/proxy/application_1561367702623_13494/ [15:13:06] but it seems stuck.. is it waiting for something? [15:13:16] resources seem to be available in the default queue [15:13:39] I changed the code of the rpki stuff, it should be working but i don't see any error in the logs [15:14:41] in fairness I requested 656384 MB of ram, that might be too much [15:16:51] 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) The expiration for objects can be specified at the time of upload, so it needs to be added to our current wo... [15:20:46] elukey: looks like it was killed? [15:20:52] ohh by user. [15:20:55] (03PS5) 10Milimetric: Fix mediawiki-history-page create event [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal) [15:20:57] 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) @EBernhardson @Ottomata re: swift expiring objects, see the link above too and tl;dr is: The X-Delete-... [15:22:24] 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) Great! @fgiunchedi you said 'that is something we'd have to deploy first'. Can I use this now? [15:26:32] ottomata: yeah I did it, re-launched with less workers but not much different.. https://yarn.wikimedia.org/cluster/app/application_1561367702623_13562 [15:26:42] so I am pretty sure I am failing to do something [15:27:26] elukey: it looks like it is doing stuff... at least it has active executors [15:28:02] so those are either super slow or doing something not good right? [15:28:10] i guess? [15:28:12] is there a way to see what they are doing? [15:28:22] https://yarn.wikimedia.org/proxy/application_1561367702623_13562/stages/ [15:28:27] 128 running [15:28:28] 0 done. [15:28:41] maybe the showString is your problem? [15:28:45] where is your .py file? [15:28:47] can I look? [15:28:59] maybe you should write to a file rather than stdout? [15:29:05] (just guessing here) [15:29:23] sure, it is in /home/elukey/rpki/rpki_invalid_prefix_finder.py on stat1004 [15:29:51] I am trying to change the UDF that joseph came up with after a chat with Faidon, but probably now the UDF is wrong [15:30:43] (03CR) 10Nuria: Fix daily and monthly partition thresholding (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) (owner: 10Mforns) [15:30:59] elukey: yeah, i think maybe the .show(10000, False) is theh problem? [15:31:07] what if you do instead [15:31:59] i think [15:32:15] ottomata: it worked before I changed the UDF, not sure if it is the show() [15:32:21] .write.options("delimeter", "\t").csv("/user/elukey/path/to/output") [15:32:21] ? [15:32:25] nuria: mforns milimetric hellooo before we freeze wikistats ui development, I think this is the one patch out of the unmerged 4 that really needs to be merged and deployed [15:32:25] https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/519016/ [15:32:30] that might be old though.. [15:32:53] hmm yeah maybe not? [15:33:00] what did you change? [15:33:16] (sorry meeting) [15:44:16] fdans, lookin! [15:48:43] 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) >>! In T213976#5292498, @Ottomata wrote: > Great! @fgiunchedi you said 'that is something we'd have to... [15:49:23] (03CR) 10Mforns: [V: 03+2] Fix daily and monthly partition thresholding (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) (owner: 10Mforns) [15:49:29] 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) Oh ok, will do! [15:57:28] hey milimetric - pre-standup? [15:57:47] omw [16:01:05] ping ottomata [16:01:15] ping fdans [16:01:43] 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10mmodell) >>! In T213976#4968603, @Ottomata wrote: > @mmodell This is kind of a 'deployment' process thing, is this... [16:01:45] (03PS6) 10Milimetric: Fix mediawiki-history-page create event [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal) [16:01:50] 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10mmodell) >>! In T213976#4995886, @Ladsgroup wrote: > Yes, you're right. Maybe turning mwmaint1002 to a minikube and... [16:02:33] OH uh [16:13:09] Nettrom: Good morning -- I apologize I'm late on deploy [16:13:38] Nettrom: I will deploy later on today (even if Frida, I have special permisions from my dear ops- [16:18:15] RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams [16:19:25] (03CR) 10Joal: [C: 03+2] Fix daily and monthly partition thresholding [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) (owner: 10Mforns) [16:19:42] thanks joal :D [16:19:45] ;) [16:26:14] (03PS2) 10Joal: Update mediawiki-history for page-history refactor [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519349 (https://phabricator.wikimedia.org/T221825) [16:29:04] joal: now I'm worried that you're reading my mind, as I was about to ask about the deployment ;) sorry you're having to do it on a Friday, I appreciate you working at this hour [16:30:02] Nettrom: I try not to read minds, I have too many problems with mine alone - I'll keep you posted on deploy happenning :) [16:35:45] milimetric: shall I merge the patch and start deployment? [16:35:50] Arf [16:35:52] sorry [16:35:53] :) [16:37:07] 10Analytics, 10Analytics-Kanban: Move reportupdater queries from limn-* repositories to reportupdater-queries - https://phabricator.wikimedia.org/T222739 (10Nuria) See code change: https://gerrit.wikimedia.org/r/#/c/analytics/reportupdater-queries/+/517084/ (merged on other task) [16:37:11] (03CR) 10Joal: [V: 03+2 C: 03+2] "Let's merge!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal) [16:37:29] ottomata: anything to merge before I deploy? [16:37:31] 10Analytics, 10Analytics-Kanban: Move reportupdater queries from limn-* repositories to reportupdater-queries - https://phabricator.wikimedia.org/T222739 (10Nuria) Puppet code : https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/517085/ [16:38:28] joal: deploy? not for me [16:38:32] it is a friday tho?! deploY!? [16:38:33] :p [16:38:50] ottomata: I got a special perm from elukey - but I know I'm doing wrong [16:39:51] 10Analytics, 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) EventStreams is hitting its concurrent connection limits of about 200 connections. We think this is probably due to a single cl... [16:39:58] :) [16:43:55] (03PS1) 10Joal: Remove incorrect comment and fix typo in changelog [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519666 [16:44:16] ottomata: what swift container for test uploads? [16:44:28] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519666 (owner: 10Joal) [16:44:44] ebernhardson: up to you mostly, what do you think would be best? [16:44:47] 'elasticsearch'? [16:45:07] ottomata: so it doesn't really matter, just a tag of sorts? [16:45:13] ya [16:45:23] gotcha [16:48:49] so the /mnt/hdfs mountpoint works with kerberos! it just fails if you don't have a ticket with input/output errors [16:50:21] nice [16:54:48] elukey: we can hack config on scb1001 to enable trace logging with X-Client-IP, restart eventstreams [16:54:51] and see [16:54:53] shall we? [16:55:12] yar i have a meeting for the next 30 mins tho... [16:57:49] ottomata: not sure if X-Client-IP is sent by varnish to the backend, but if so yes! [16:57:56] it should be! [16:58:25] ok will do this meeting then we try [16:59:13] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [17:02:45] ottomata: or we could use tcpdump :) [17:08:06] (03CR) 10Milimetric: [C: 03+2] Update mediawiki-history for page-history refactor [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519349 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal) [17:08:08] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update mediawiki-history for page-history refactor [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519349 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal) [17:08:34] (03CR) 10Joal: [V: 03+2] Bump jar version for oozie webrequest load bundle [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519506 (https://phabricator.wikimedia.org/T225792) (owner: 10Joal) [17:11:54] 10Analytics, 10Analytics-Kanban, 10Operations, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) [17:12:21] !log Refinery-source v0.0.93 released to archiva [17:12:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:12:38] !log Deploying refinery with scap [17:12:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:22:29] 10Analytics, 10Operations, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Pchelolo) [17:28:23] elukey: know of a good way to be sure to catcpure the X-Client-IP Header? [17:28:34] with tcpdump? [17:28:36] 10Analytics, 10Analytics-Kanban, 10Operations, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) [17:29:36] 10Analytics, 10Analytics-Kanban, 10Operations, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) [17:30:57] acutally maybe sudo tcpdump -A port 8092 | grep X-Client-IP is enought! [17:31:05] ottomata: in theory we could capture everything with -A and then use wireshark to follow the HTTP flow, lemme try [17:31:11] not sure if you'll see all the headers [17:31:35] 10Analytics, 10Analytics-Kanban, 10Operations, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) [17:31:38] althoughh i'm not sure if i'm seeing that every time [17:31:48] elukey: hacking the config there is pretty easy [17:31:54] we could do it officially via scap if we wanted to [17:32:00] but it is just adding a couple of lines in the config [17:32:02] sure we can do it manually [17:33:40] ok elukey [17:33:40] hacked [17:33:44] will depool, restart, pool [17:33:53] (03CR) 10Mforns: [C: 03+2] Create "all" time ranges based on the metric config [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/519016 (https://phabricator.wikimedia.org/T226486) (owner: 10Fdans) [17:33:54] sudo journalctl -f -u eventstreams [17:33:56] to follow :) [17:34:01] "Unexpected error while saving file: database or disk is full" uhh everything alright with notebook1004? [17:34:13] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [17:34:24] darn, bearloga beat me to reporting that [17:34:28] aaah! clicked the +2 button by mistake again! :[ [17:34:37] bearloga: checking, did any of you by any chance create a lot of files? [17:34:40] sighh probably not! who's using theh space again?!?! [17:34:53] /dev/mapper/notebook1004--vg-data 136G 129G 0 100% /srv [17:35:05] we need quotas [17:35:29] I am checking homes [17:35:50] 47G nathante [17:35:57] yep [17:36:06] he is not online though [17:36:25] elukey, is there a way to interrupt a gate and submit job? [17:36:40] mforns: not sure what you mean [17:36:45] elukey, can you cancel the wikistats one: https://integration.wikimedia.org/zuul/ [17:36:50] ahhhh [17:36:59] I clicked on +2 accidentally, [17:37:17] nothing super critical, but just wanted to avoid the revert... [17:37:24] if you revert your +2, shoudn't it be ok? [17:37:34] the CI should only add V+2 IIRC [17:38:22] elukey, ahaahh, but the description of the job in zuul said it would merge it [17:38:26] it didn't though [17:38:31] super :) [17:39:08] thanks elukey, I think gerrit +2 button should not be blue! [17:39:18] ahahahah [17:39:25] should be gray like all the normal buttons! [17:41:41] Just DMed Nate on Twitter [17:42:52] fixed it bearloga [17:43:01] there was a trash directory taking ~36G [17:43:38] !log deleted /srv/home/nathante/.local/share/Trash/* to free space on notebook1004 [17:43:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:44:19] hi [17:45:57] groceryheist: hello :) [17:46:10] there was a big trash directory in /srv/home/nathante/.local/share/Trash/ [17:46:20] that was causing a lot of disk usage on notebook1003 [17:46:22] err 1004 [17:46:26] now gone, so all good :) [17:48:37] ok restarted and repooled, these logs go to /srv/log/eventstreams/main.log [17:48:51] Is that trash dir only for files that get deleted through the jupyter interface? rm doesn't do that, right? [17:49:19] although I don't see any for /v2/streams... [17:49:27] yeah I wasn't really aware this was a thing that happens [17:49:37] bearloga: yes I think so, rm brutally drops data :) [17:52:10] ottomata: did you find anything useful on scb1001? [17:54:13] elukey: its still logging but not yet? i do'nt see any incoming stream requets... [17:54:56] yar, hang on gotta go afk for a few mins.... [18:00:48] (03CR) 10Mforns: "Sorry for the mess with +2 and jenkins." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/519016 (https://phabricator.wikimedia.org/T226486) (owner: 10Fdans) [18:01:23] elukey: please excuwe me, I'd need some help with scap (I assume size related) [18:02:16] joal: sure! [18:02:25] notebooks complaining? [18:02:40] elukey: 2 targets failed, don't know which [18:03:52] yep notebook1003 100% usage [18:03:57] indeed elukey - notebook1003 and notebook1004 [18:04:01] :s [18:04:06] shall I rollback? [18:04:08] joal: I really think that we should force scap to keep only one revision [18:04:11] no no [18:04:14] lemme fix it [18:04:25] I say no to rollback then [18:04:29] ok? [18:04:50] elukey: We could keep more than 1, maybe like 5 - but that would already be a lot less [18:05:32] joal: no no more than 1 is exactly the problem, those revs are huge [18:05:35] we have 2 now [18:05:49] so I cleaned up notebook1003 [18:05:51] Ahhhh - you're talking about jars !!!! [18:06:04] well the whole thing [18:06:04] 1-rev whu not, and 5 jars no more :) [18:06:11] ahhh yes 5 jars [18:06:13] yes yes [18:06:17] ok I got you :) [18:06:21] same for me :) [18:06:36] ok - so saying 'no' to scap about rollback ? [18:06:39] elukey: --^ [18:06:41] yep yep [18:06:51] ok- I'm afraid when doing so :) [18:06:56] you can try to deploy with limit to 1003 [18:07:03] 1004 need to check why it failed [18:07:15] I think it's space related as well [18:07:39] ah interesting [18:07:40] Using deprecated git_fat config, swap to git_binary_manager [18:07:47] not the error but worth to follow up [18:08:35] elukey: going for notebook1003 [18:08:50] went so fast that I ssumed it actually failed (no jar downloaded) [18:08:52] joal: try also 1004, srv partition is 73% full [18:08:54] so a lot of space [18:08:59] ok [18:09:13] Same thing [18:09:24] where/how would we trim the jars that we keep? (so I can try to follow up next week) [18:09:25] super fast, no jar download [18:09:38] maybe you can --force [18:10:01] elukey: It means checking everywhere (oozie, timers, cron) about version and if we can safely bump [18:10:19] ah so manually [18:10:45] elukey: I think we need to rollback the last rev on notebooks, to deploy again (jars corrupted) [18:11:05] joal: iirc if you --force you should create a new rev no? [18:11:25] Ah, I don;t know [18:11:27] testing [18:11:58] seems correct indeed elukey ! many thanks :) [18:12:25] \o/ [18:13:46] ok full deploiy done with scap - deploying on hdfs [18:13:59] groceryheist Nettrom elukey: so I looked everywhere in Jupyter Notebook & Lab interfaces and there's nothing about emptying the trash. [18:14:05] Nettrom: Your whitelist patch has been deployed, you should see changes soon :) [18:14:29] joal: going afk a bit for dinner, will bbl! [18:14:39] bye elukey - Thanks agai [18:14:41] joal: wonderful, thanks for making that happen! [18:15:04] so not sure what we can do going forward except…uh…remember to SSH into notebook100X every now and then and check it ourselves??? [18:15:31] !log Deploy refinery to HDSF [18:15:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:15:47] bearloga: setup a cron job to remove big files from the trash? [18:17:23] groceryheist: oooooh I like that idea. hm… ottomata: is that something that can be enabled for all users automatically? (a cron job to remove big trashed files, since there's no way to do it through Jupyter's UI) [18:24:32] Arf- forgot to update jar version in data-quality bundle prop [18:25:43] (03PS1) 10Joal: Bump data-quality oozie bundle jar to v0.0.93 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519678 [18:25:52] mforns: --^ [18:29:10] phew ok back [18:30:55] ok collected a buncha logs [18:31:04] resetting eventstreams logging back to normal [18:31:46] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519678 (owner: 10Joal) [18:32:34] ottomata: I need to dpeloy again (refinery with scap), had forgotten a jar-bump - Do you mind being reachable, I already ran intio issues earlier on becasue of space :( [18:32:43] joal: am here now ya [18:33:27] Thanks [18:33:46] !log Deploy refinery with scap [18:33:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:34:22] elukey: indeed a see a loot from a single IP [18:35:06] lot* [18:35:15] proceeding with the ip throttling idea [18:35:34] 10Analytics, 10Operations, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) Collected some info about which IPs were connecting on scb1001. Over a period of about 40 minutes: 3 "100.26.... [18:40:17] (03PS1) 10Zhuyifei1999: queryrun.py: Default extra_info to {} [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/519679 [18:41:15] ottomata: what is the idea? code in es? [18:42:00] bearloga: about the Trash, there must be a way to autoclean it, let's try to find it first :) [18:42:15] (maybe a setting in jupyterhub) [18:44:25] (03CR) 10Zhuyifei1999: [C: 03+2] queryrun.py: Default extra_info to {} [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/519679 (owner: 10Zhuyifei1999) [18:44:35] ottomata: I am logging off now, but please ping me if you need me [18:44:48] (03Merged) 10jenkins-bot: queryrun.py: Default extra_info to {} [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/519679 (owner: 10Zhuyifei1999) [18:44:53] ottomata: i can be rubberduck in elukey 's absence [18:46:09] joal, was in meeting, cool! thanks for deploy! [18:46:18] np, redeploing [18:50:21] ottomata: error on analytics1030.eqiad.wmnet [18:50:23] :( [18:50:28] I assume disk full [18:51:01] ottomata: I now know how to redeploy (limit and force), but I'd need you to remove some revs first please [18:52:14] ! Deploying refinery to HDFS [18:52:46] !log Kill webrequest bundle [18:52:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:53:16] !log Kill data-quality-hourly bundle [18:53:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:54:34] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list as needed for iOS - https://phabricator.wikimedia.org/T226849 (10kzimmerman) [18:55:17] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Web - https://phabricator.wikimedia.org/T226850 (10kzimmerman) [18:55:58] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Android - https://phabricator.wikimedia.org/T226852 (10kzimmerman) [18:56:43] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for AHT - https://phabricator.wikimedia.org/T226853 (10kzimmerman) [18:57:06] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Growth - https://phabricator.wikimedia.org/T226854 (10kzimmerman) [18:57:48] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Editing - https://phabricator.wikimedia.org/T226855 (10kzimmerman) [18:58:03] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Language - https://phabricator.wikimedia.org/T226856 (10kzimmerman) [18:59:25] !log Restart Webrequest bundle [18:59:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:01:07] sorry joal just sawa ping [19:01:15] looking [19:01:20] thanks ottomata :) [19:02:06] mforns: !log restart data-quality-hourly bundle [19:02:35] joal, :] [19:02:51] Ok - Taking a break, will monitor jobs in a bit, and also finalize deploy for MWH now that other jobs are done [19:02:51] ok joal try now [19:02:55] sure ottomata [19:08:07] Worked ottomata - many thanks :) [19:11:51] great [19:15:15] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech - https://phabricator.wikimedia.org/T226861 (10nettrom_WMF) [19:16:21] 10Analytics, 10Community-Tech, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech - https://phabricator.wikimedia.org/T226861 (10nettrom_WMF) [19:16:52] 10Analytics, 10Analytics-Kanban: Make timers that delete data use the new deletion script - https://phabricator.wikimedia.org/T226862 (10mforns) [19:17:14] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for AHT - https://phabricator.wikimedia.org/T226853 (10nettrom_WMF) [19:18:25] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Growth - https://phabricator.wikimedia.org/T226854 (10nettrom_WMF) [19:20:44] cc Nettrom deployment happen today [19:20:53] due to some last minute work [19:27:00] nuria: thanks! joal kept me in the loop on that today. Checked the Data Lake now and can confirm that data is flowing in and appears to be correct [19:27:43] time for lunch [19:53:43] a-team, finished what I wanted to do, going for vacation, see you in a couple weeks! have fun :] [19:53:51] byyeeee marcel! [19:53:55] mforns: have an awesome time! [19:54:05] o/ mforns [19:54:06] See you Marcel, enjoy your time off ! [19:54:37] thank youuuu! [19:54:48] joal, have a nice time too :D [19:54:54] I will :) [19:55:24] :] [20:04:32] !log drop-recreate mediawiki_history, mediawiki_page_history and mediawiki_user_history tables in hive [20:04:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:09:26] groceryheist: btw, since you work with readingDepth data this new metric might be of interest [20:09:55] groceryheist: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/SessionLength [20:47:09] (03CR) 10Milimetric: Fix mediawiki-history-page create event (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal) [20:57:51] !log Restart mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-reduced-coord [20:57:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:59:02] 10Analytics, 10Operations, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) [21:01:54] (03PS1) 10Joal: Fix typo in mediawiki_page_history table creation [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519719 [21:03:19] (03CR) 10Joal: [V: 03+2 C: 03+2] Fix mediawiki-history-page create event (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal) [21:04:12] (03PS2) 10Joal: Fix typo in mediawiki_page_history table creation [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519719 [21:04:30] (03CR) 10Joal: [V: 03+2 C: 03+2] "typo, merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519719 (owner: 10Joal) [21:05:29] ok done for tonight - prod is almost ready to welcome the new snapshot, only missing a checker-coherent of previous version ,which is currently being computed [21:22:53] 10Analytics, 10Operations, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) To hold us over on the weekend, I've manually blacklisted the offending IP in Eve... [21:23:46] gotta run! have a good weekend all! [23:03:08] 10Analytics, 10Operations, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Nuria) {F29666420} Well, blocking that one IP had the effect of lowering connections. Give... [23:05:58] thanks nuria! [23:06:28] that's pretty nice! [23:07:03] groceryheist: we hope to have that data by q2, that is probably oct or so [23:07:20] nuria: I'd like to work on helping create pathways to release this public versions of kind of data [23:07:51] cool [23:08:03] would you like some feedback on the design? [23:18:40] groceryheist: totally, [23:19:02] groceryheist: you can do it wiki style (talk page) or the channel