[00:09:35] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10ayounsi) Thanks for looking into that that! About TCP flags, if you filter with `IP proto: tcp` there should not be any null. It's expected that other IP... [01:36:34] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10Nuria) @ayounsi if you can provide a map like: map: {"32":"URG", "16":"ACK","8":"PSH", "4":"RST", "2":"SYN", "1":"FIN"} we can do th... [02:17:01] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10ayounsi) Adding the following to the ones you listed should cover most of the cases: > {'49': 'ACK+URG+FIN', '48': 'ACK+URG', '40': 'PSH+URG', '36': 'RST+... [06:24:01] 10Analytics-Kanban: Test if Hue can run with Python3 - https://phabricator.wikimedia.org/T233073 (10elukey) From upstream, it seems that the fixed version of Hue (4.3) will only be available in CDH6. [06:27:48] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) [06:27:57] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) a:03elukey [06:28:22] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Create a spicerack recipe to reboot the hadoop worker nodes - https://phabricator.wikimedia.org/T225297 (10elukey) [06:34:26] morningggg [06:34:36] the script to roll reboot hadoop seems working fine [06:35:06] it reboots in batches (non-hdfs-journal-nodes first) and if one reboot fails, it asks for confirmation before proceeding [06:35:15] then it reboots journal node hosts one at the time [06:35:23] (this is only for workers) [06:35:48] for every batch, it [06:35:54] 1) stops yarn and waits 10 minutes [06:35:58] 2) stop hdfs datanote [06:36:03] 3) in case, stop the journalnode [06:36:06] 4) reboot [06:36:23] for every *host* in the batch, it [06:36:46] I think it is finally ready now [07:03:42] https://www.datanami.com/2019/01/10/cloudera-unveils-cdp-talks-up-enterprise-data-cloud/ [07:03:48] sigh [07:03:55] that complicates our plans for CDH6 [07:04:16] or better, it might mean upgrading to 6 and then to something else soon [07:30:28] (03CR) 10Fdans: [V: 03+2 C: 03+2] Add fake test data for mediarequests per file [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/536556 (owner: 10Fdans) [07:30:50] look fdans that starts working without even say hello [07:31:46] OH SORRY LUCA HOW ARE YOU HOPE YOU'RE HAVING A MOST MAGICAL DAY [07:32:01] hahahahahaah [07:32:24] a "good morning" would have been enough Francisco but I appreciate the effort :P [07:33:23] elukey: I saw your question on the hue repo the yesterday [07:33:45] you could send a PR titled "START OVER" that just deletes the entire project [07:36:02] ahahahhah [07:36:11] yes a really polite enquiry [07:56:48] 10Analytics, 10Operations, 10User-Elukey: setup/install eqiad kerbos node WMF5173 - https://phabricator.wikimedia.org/T233141 (10elukey) a:05elukey→03RobH Thanks a lot! Hostnames: krb1001 krb2001 (already updated the naming conventions in wikitech) Internal subnet, no Analytics VLAN raid1 is good enough [07:57:19] 10Analytics, 10Operations, 10User-Elukey: setup/install codfw kerbos node WMF6577 - https://phabricator.wikimedia.org/T233142 (10elukey) a:05elukey→03RobH Thanks a lot! Hostnames: krb1001 krb2001 (already updated the naming conventions in wikitech) Internal subnet, no Analytics VLAN raid1 is good enough [08:27:10] just sent an email to the team about the recent hadoop discoveries [08:27:19] I see a lot of joy for us in the future [08:27:20] sigh [08:47:05] (03PS3) 10Fdans: Add aggregate mediarequests per referer endpoint [analytics/aqs] - 10https://gerrit.wikimedia.org/r/537114 (https://phabricator.wikimedia.org/T232857) [08:47:40] (03CR) 10Fdans: [V: 03+1] "@joal this is ready to merge before deploying aqs if ok with you" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/537114 (https://phabricator.wikimedia.org/T232857) (owner: 10Fdans) [08:54:34] (03PS1) 10Fdans: Add mediarequests per referer to fake data script [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537601 (https://phabricator.wikimedia.org/T232857) [09:39:00] 10Analytics, 10User-Elukey: Show IPs matching a list of IP subnets in Webrequest data - https://phabricator.wikimedia.org/T220639 (10elukey) @ayounsi should we keep this task open? [10:09:48] * elukey errand + lunch! [10:21:00] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MGerlach) 05Resolved→03Open Thanks, I can ssh into production servers. However, I cannot access SWAP following [[ https://wikitech.wikimedia.org/wiki/... [12:02:08] (03CR) 10Joal: [C: 03+2] "The test job launched by @mforns succeeded. Merging!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/528504 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [12:04:49] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for today's deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537601 (https://phabricator.wikimedia.org/T232857) (owner: 10Fdans) [12:05:37] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for today's deploy" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/537114 (https://phabricator.wikimedia.org/T232857) (owner: 10Fdans) [12:06:18] (03Merged) 10jenkins-bot: Add spark job to create mediawiki history dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/528504 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [12:06:37] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for today's deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/530002 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [12:06:43] thanks joal!! [12:07:05] no prob mforns :) [12:07:18] Preparing this evening full deploy - will release new jar now [12:09:48] (03PS1) 10Joal: Bump changelog to v0.0.100 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/537632 [12:09:57] mforns: --^ please if you have a minute :) [12:12:55] (03PS1) 10Joal: Bump webrequest-load jar version to v0.0.100 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537633 (https://phabricator.wikimedia.org/T212854) [12:13:17] This one could be double-checked as well mforns :) [12:15:39] joal: o/ [12:15:44] Hi elukey :) [12:15:47] Siesta time ;) [12:15:48] did you see my lovely link about cloudera? [12:15:54] siestaaaaa [12:15:55] I did [12:16:17] elukey: will bigtop become a less than second class alternative? [12:16:38] elukey: I have info for you about notebookerbs [12:16:40] no idea, it depends by a lot of things, but it is a mess :( [12:16:49] it is indeed :( [12:20:18] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10elukey) @MGerlach done! Please take a moment to review https://wikitech.wikimedia.org/wiki/LDAP/Groups#wmf_group, since your account is now able to see a... [12:23:52] looks like mforns is gone - elukey would you mind? https://gerrit.wikimedia.org/r/537632 [12:24:54] elukey: notebookerbs work great with a kinit, except for hive access (I was kinda expecting that) [12:26:31] oh actually elukey - my superbad [12:26:32] joal: ah yes we'll probably need to fix that [12:26:39] oh no mmm [12:26:46] in theory if it doesn't use JDBC it should work [12:26:47] elukey: I think the problem was pbcak [12:26:47] in theory [12:27:13] (03CR) 10Elukey: [C: 03+1] "100!! We should celebrate!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/537632 (owner: 10Joal) [12:27:36] Maybe at some point we'll go to 0.1.0 :) [12:27:54] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/537632 (owner: 10Joal) [12:39:11] !log Release refinery-source v0.0.100 to archiva [12:39:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:49:46] (03PS2) 10Joal: Bump webrequest-load jar version to v0.0.100 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537633 (https://phabricator.wikimedia.org/T212854) [12:50:17] elukey: if you don't mind please :) [12:50:21] --^ [12:50:33] all the failures in alerts@ are you joal? [12:50:51] (03CR) 10Elukey: [C: 03+1] Bump webrequest-load jar version to v0.0.100 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537633 (https://phabricator.wikimedia.org/T212854) (owner: 10Joal) [12:51:03] elukey: not supposed to ! [12:51:09] also I have stuff to review/merge before deploying refinery :) [12:51:12] elukey: the jenkins ones yes, the other ones not [12:51:21] elukey: ack !!! [12:53:20] mmmm seems a problem while contacting the hive 2 server [12:53:31] yes [12:53:33] joal: ok for me to restart the webrequest job? [12:53:38] please [12:53:43] I was looking at those as well [12:54:39] !log re-run webrequest-load upload/text for hour 11 due to transient hive server socket failures [12:54:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:54:49] elukey: this is bizarre :( [12:55:26] elukey: hole in metrics as well at the time [12:55:42] no bueno [12:56:19] elukey: about notebookerbs I disconfirm me having problems - all seems working great :) [12:56:24] Sorry for bothering :) [12:56:31] Will update the page [12:56:42] I am super happy, thanks for the tests! [12:57:18] ● hive-server2.service - LSB: Hive Server2 [12:57:18] Loaded: loaded (/etc/init.d/hive-server2; generated; vendor preset: enabled) [12:57:21] Active: active (exited) since Wed 2019-09-18 12:38:09 UTC; 18min ago [12:57:24] 18min ago [12:57:32] Actually one thing is not great: no meaningfull error message when failing to start a kernel due to kerb errors [12:58:21] metastore also restarted [12:58:28] Thanks elukey [12:58:32] I really wonder :( [12:58:53] weird [12:59:27] Wow - weird error message on wikitech [12:59:42] wikitech? [13:25:49] elukey: pinged on ops chan, and created a task (https://phabricator.wikimedia.org/T233215) [13:26:14] elukey: I updated your page: [13:26:19] https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_cluster [13:26:40] at the end - some things could/should be better I guess - Particularly in reporting errors [13:26:57] elukey: let me know when ready for merge/deploy on refinery [13:27:47] so the code is https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/537255/ [13:27:53] it needs marcel's +1 [13:27:57] mforns: you there? :) [13:28:07] ah right elukey - forgot that [13:28:54] is it blocking you? [13:28:58] elukey: any idea of what caused the hive-server hiccup? [13:29:13] not really, didn't find anything in the logs [13:29:17] elukey: I was planning to move on deploying, in order to have only aqs later on [13:29:49] elukey: upload succeeded - I'm assuming text will as well [13:31:11] the host's metrics were not showing up any spike in cpu or memory [13:31:51] no oom killer [13:31:59] hive logs are not very indicative [13:32:23] and both server2 and metastore down [13:32:24] mmmm [13:33:02] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for later deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537633 (https://phabricator.wikimedia.org/T212854) (owner: 10Joal) [13:33:31] (03PS2) 10Elukey: Force execution of all the (python) scripts under bin/ with python3 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537255 (https://phabricator.wikimedia.org/T204735) [13:33:39] (03CR) 10Elukey: [V: 03+2 C: 03+2] Force execution of all the (python) scripts under bin/ with python3 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537255 (https://phabricator.wikimedia.org/T204735) (owner: 10Elukey) [13:33:51] joal: worst that can happen is that we'll restore one script [13:33:54] please go ahead :) [13:34:02] ack !! [13:35:53] !log Deploying refinery using scap [13:35:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:41:26] ottomata: hello? [13:41:43] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MGerlach) 05Open→03Resolved @elukey thanks, works now. Closing this taks. [13:41:58] !log Deploy refinery to hdfs [13:42:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:44:56] Gone back to kids [13:46:20] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Thanks to the awesome work of @Jclark-ctr an-presto1001 and an-presto1003 are now reimaged, but an-p... [13:47:59] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Proposed fix for asw2-b: ` delete interfaces interface-range cloud-hosts1-b-eqiad member xe-4/0/5 s... [13:48:49] hey elukey I'm back [13:48:59] hola! [13:49:18] I merged https://gerrit.wikimedia.org/r/537255 to unblock joal, buuut if you could triple check I'd be grateful :) [13:49:25] ok [13:49:51] <3 [13:50:44] 4/5 presto nodes ready [13:50:49] last one standing :D [13:52:17] elukey, looks good to me! [13:52:39] super :) [13:55:49] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10akosiaris) >>! In T225128#5503053, @elukey wrote: > Proposed fix for asw2-b: > > ` > delete interfaces inte... [14:02:56] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Committed: ` elukey@asw2-b-eqiad# show | compare [edit interfaces interface-range vlan-cloud-hosts1... [14:07:42] last presto node reimaging \o/ [14:21:32] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 (10Ottomata) Just ran ` apt-get purge python-zmq python-tornado python-ua-parser python-urllib3 py... [14:22:12] yay! [14:22:22] mgerlach: o/ [14:22:24] how goes?! [14:26:27] all presto nodes with buster [14:41:40] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Ok so current status: * All hosts reimaged to buster and working * Renamed hostnames in netbox * Wai... [14:46:20] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) AWESOME thank youuuu [14:50:24] ottomata: o/ all good. thanks for your reply on the venv. [14:51:57] easier to start with the mediawiki-utilities but you are right that performance-wise I should look into the hive table [14:52:44] mgerlach: well, a nice thing about not using mwxml is...you don't need the xml, and you don't need to parse any exml [14:52:44] xml [14:53:31] ottomata: indeed, thats a big plus (although the tool takes away almost all the xml-awfulness) [14:56:06] mgerlach: https://wikitech.wikimedia.org/wiki/SWAP#sql_magic [14:56:42] haha although i just tried and got a big error [14:56:44] with sql_magic.... [14:57:26] !pip install sql_magic in noteebook helped firrst [14:58:17] ottomata: do you recommend running via notebooks? [14:58:27] depends on what you are doing [14:58:58] if you like jupyter notebooks go ahead...but they aren't the easiest things for us to maintain, so they often have problems... [14:59:49] you can do the same stuff from cli too (although not sql magic? not sure) [15:00:28] perfect. i will check it out (just got the access a few minutes ago ; ) [15:07:28] mgerlach: i changed the example at https://wikitech.wikimedia.org/wiki/SWAP#with_Hive_(MapReduce): to use mediawiki_wikitext_history [15:08:22] the main thing to be careful about on with notebooks is to not store big data on the local hard drives [15:08:27] they aren't huge, and that is what HDFS is for [15:08:49] ottomata: ok, got it. thanks [15:09:32] oo, also, an important thing about any hive table [15:09:32] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Use_partitions [15:09:47] not all tables have the same partitions fields [15:09:52] but if you show create table or describe the table [15:09:59] it will tell you which ones are partition fields [15:10:04] you can also do [15:10:11] show partitions to see what is avialable [15:10:23] although for some tables there will be a LOT of partitions [15:33:39] ottomata: ops sync? [15:33:57] oh ho [15:33:58] yes [15:36:18] PROBLEM - Check the last execution of refine_eventlogging_eventbus_job_queue on an-coord1001 is CRITICAL: NRPE: Command check_check_refine_eventlogging_eventbus_job_queue_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:55:54] ah this is due to the eventbus cleanup [15:55:57] np [15:57:40] i think that is a laggy icinga [15:57:41] 10Analytics: Upgrade eventlogging to Python 3 - https://phabricator.wikimedia.org/T233231 (10elukey) [15:57:43] running puppet on icinga [16:01:04] ping joal [16:01:11] 10Analytics, 10User-Elukey: Show IPs matching a list of IP subnets in Webrequest data - https://phabricator.wikimedia.org/T220639 (10ayounsi) 05Open→03Resolved All good here. Thanks! [16:18:00] PROBLEM - Check the last execution of refinery-drop-webrequest-raw-partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:18:37] elukey: this feels like a p3 related error --^ [16:19:06] buuuuu [16:19:07] :) [16:19:09] checking [16:19:17] Thanks :) [16:25:54] PROBLEM - Check the last execution of refinery-drop-apiaction-partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-apiaction-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:26:26] PROBLEM - Check the last execution of refinery-drop-eventlogging-partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-eventlogging-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:26:50] yes yes bad Luca [16:26:56] uffff [16:27:22] set some downtime [16:28:07] what happens elukey ? [16:28:14] deleted wrong files? [16:28:23] nono the script doesn't even run [16:28:36] I think it is a docopt issue [16:29:43] Mwarf :( [16:31:41] 10Analytics: add agent-type dimension to pageviews per country endpoint - https://phabricator.wikimedia.org/T233238 (10Nuria) [16:31:52] 10Analytics: add agent-type dimension to pageviews per country endpoint - https://phabricator.wikimedia.org/T233238 (10Nuria) p:05Triage→03Normal [16:32:46] PROBLEM - Check the last execution of refinery-drop-cirrussearchrequestset-partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-cirrussearchrequestset-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:33:40] 10Analytics: add agent-type dimension to pageviews per country endpoint - https://phabricator.wikimedia.org/T233238 (10Nuria) We can, load this data into a "shadow" table, v2 in cassandra and after swap the current table by the other one. [16:35:40] PROBLEM - Check the last execution of refinery-drop-eventlogging-client-side-partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-eventlogging-client-side-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:36:08] elukey: should I rollback the change? [16:36:56] joal: I am still testing, the rollback can be a manual oneliner on an-coord1001, will do it if I can't find the issue [16:37:07] k elukey [16:37:08] first was that python3-mock wasn't on an-coord1001 [16:37:11] just installed it [16:39:39] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 (10Ottomata) Phew ok after all those patches, I think we are good with puppet code cleanup! [16:40:34] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 (10Ottomata) [16:44:02] 10Analytics, 10Analytics-Kanban: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) So it seems that the `refinery-drop-older-than` script has some issues with python3: 1) python3-mock was not installed, but no errors were shown due to `sys.stderr = open(os.devnull,... [16:44:32] PROBLEM - Check the last execution of camus-eventbus on an-coord1001 is CRITICAL: NRPE: Command check_check_camus-eventbus_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:45:01] !log manually set "#!/usr/bin/env python" for refinery-drop-older-than on an-coord1001 to restore functionality (minor bug encountered) [16:45:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:46:00] !log manually restarted the refinery-drop-older-than jobs [16:46:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:46:12] RECOVERY - Check the last execution of refinery-drop-eventlogging-client-side-partitions on an-coord1001 is OK: OK: Status of the systemd unit refinery-drop-eventlogging-client-side-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:46:37] joal: manually restored python2, everything works now.. will work with mforns on a fix, but I'd like to leave the rest running to see if we encounter more issues [16:46:40] if you are ok [16:47:00] elukey, can I help? [16:47:09] no problem for me - there might be some alarms at new day [16:47:43] mforns: o/ - it is my bad, I was too confident with python3 :) - I am getting https://phabricator.wikimedia.org/T204735#5503751 [16:48:01] it is probably a quick change, but I guess that the script needs to be tested more with python 3 [16:49:02] RECOVERY - Check the last execution of refinery-drop-webrequest-raw-partitions on an-coord1001 is OK: OK: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:49:28] elukey, ok, will change now [16:50:54] mforns: nono I already rolled back manually, whenever you have time :) [16:51:00] oh ok [16:51:04] this bit hash_message = bytes(sorted(hash_args.items())) [16:51:06] leads to [16:51:13] TypeError: 'tuple' object cannot be interpreted as an integer [16:51:19] yea... weird [16:51:33] I don't see any tuple [16:53:25] RECOVERY - Check the last execution of refinery-drop-cirrussearchrequestset-partitions on an-coord1001 is OK: OK: Status of the systemd unit refinery-drop-cirrussearchrequestset-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:54:04] 10Analytics, 10Analytics-Kanban: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) ` >>> hash_args = {'a': 'a', 'b': 'b'} >>> hash_message = bytes(sorted(hash_args.items())) Traceback (most recent call last): File "", line 1, in TypeError: 'tuple'... [16:54:07] mforns: --^ [16:54:41] elukey, yea, items() generates a list of tuples [16:56:12] so it seems that applying bytes() to [('a', 'a'), ('b', 'b')] differs in python3 [16:57:45] stepping afk 10 mins :) [16:58:31] k [17:10:02] RECOVERY - Check the last execution of refinery-drop-apiaction-partitions on an-coord1001 is OK: OK: Status of the systemd unit refinery-drop-apiaction-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:11:38] RECOVERY - Check the last execution of refinery-drop-eventlogging-partitions on an-coord1001 is OK: OK: Status of the systemd unit refinery-drop-eventlogging-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:21:01] 10Analytics, 10Analytics-Kanban: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) help(bytes) in python2 leads to help(str), meanwhile on python3: ` class bytes(object) | bytes(iterable_of_ints) -> bytes | bytes(string, encoding[, errors]) -> bytes | bytes(b... [17:21:20] mforns: going afk for the day, tomorrow if you have time let's check --^ [17:21:32] (will read later) [17:41:00] 10Analytics, 10Analytics-Kanban: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10mforns) Thanks Luca for the pastes. I changed a bit the syntax to adapt to python3. And tested that everything is ok :]. Luckily, the checksums match the ones generated by python2, so we won'... [17:41:38] (03PS1) 10Mforns: Fix python3 incompatibility in refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537705 (https://phabricator.wikimedia.org/T204735) [17:43:34] 10Analytics, 10DBA, 10Data-Services: Prepare and check storage layer for hi.wikisource - https://phabricator.wikimedia.org/T219374 (10Urbanecm) a:03Marostegui Database was created. [18:02:16] nuria: are you joining? [18:02:34] leila: ah, yes, sorry [18:09:53] !log Restart eventlogging with new ua-parser (ottomata did) [18:09:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:10:42] !log Kill-restart webrequest-load oozie job to pick-up new ua-parser [18:10:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:16:22] !log Start mediawiki-history-dumps oozie job starting with August 2019 [18:16:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:19:33] so far it seems no moar python failures right? [18:19:59] gooood [18:40:29] (03PS1) 10Joal: Update aqs to 3df76ab [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537721 [18:40:59] fdans: if you're still here, would you mind? --^ [18:42:30] if not, I'll do it myself (aqs latest commit-sha matches the one in the above aqs-deploy patch, which contains only node_modules changes in addition to the src one) [18:44:31] ok - merging for deploy myself :) [18:44:45] ottomata: are you nearby (in case, I'm deploying AQS soon) [18:45:36] joal: 8 mins ok? [18:45:43] np ottomata - Merging and preping [18:47:30] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537721 (owner: 10Joal) [18:51:09] Ok - new user-agent is deployed :) [18:51:52] Updating webrequest and pageview doc with high-level doc, will provide details tomorrow on dedicated page [18:53:13] k here [18:53:25] Ok ottomata - Everything is ready, deploying :) [18:53:28] k [18:53:33] !log Deploy AQS using scap [18:53:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:55:07] Mwarf -faile [18:56:52] oh ? [18:56:53] fail how? [18:57:03] problem of automated testing [18:57:20] I had inserted the data, so I don't get it [18:58:06] Ok I get it - Will correct this now [18:58:36] I should have been more concentrated at CR [18:59:29] !log Deploy AQS using scap - Try 2 [18:59:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:01:24] ok - second fail [19:02:11] joal: am here if you want me to help, let me know [19:02:30] Thanks ottomata - not ops related I think [19:03:44] Very much ... [19:04:03] ok - Will provide a patch and redeploy [19:05:25] (03PS1) 10Joal: Fix mediarequest per referer sample request [analytics/aqs] - 10https://gerrit.wikimedia.org/r/537728 [19:05:33] ottomata: --^ if you don't mind [19:05:47] ottomata: maybe more details in commit message I guess [19:10:07] ottomata: ping? [19:10:13] joal: ya [19:10:15] sorry [19:10:26] Thanks :) [19:10:28] k [19:10:32] (03CR) 10Ottomata: [C: 03+2] Fix mediarequest per referer sample request [analytics/aqs] - 10https://gerrit.wikimedia.org/r/537728 (owner: 10Joal) [19:13:00] (03PS1) 10Joal: Update aqs to 7a89363 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537731 [19:13:56] brb getting coffee [19:14:07] sure ottomata [19:14:13] Will merge and prep for deploy [19:14:33] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging fix for deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537731 (owner: 10Joal) [19:20:48] 10Analytics, 10Operations, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10herron) p:05Triage→03Normal [19:23:22] bavck [19:23:31] k ottomata - deploying :) [19:23:41] k [19:23:50] !log Deploy AQS using scap - Try 3 [19:23:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:24:27] here we go - successfull canary [19:27:28] ok deploy successful :) [19:27:38] great! [19:27:41] Thanks ottomata for the supervising :) [19:27:47] i did a great job [19:27:47] :p [19:27:56] wikileader! [19:28:23] ok - gone for tonight, deploy is finished :) [19:28:34] ok, thanks joal, laterrrsss! [19:28:36] more docs on ua-parser tomorrow (basics are here) [20:36:35] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix python3 incompatibility in refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537705 (https://phabricator.wikimedia.org/T204735) (owner: 10Mforns) [23:17:11] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for AHT - https://phabricator.wikimedia.org/T226853 (10nettrom_WMF) 05Open→03Resolved [[ https://meta.wikimedia.org/wiki/Schema:AutoblockIpBlock | Schema:AutoblockIpBlock... [23:17:13] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list - https://phabricator.wikimedia.org/T220410 (10nettrom_WMF) [23:20:17] 10Analytics, 10Community-Tech, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech - https://phabricator.wikimedia.org/T226861 (10nettrom_WMF) @ifried or @aezell : not sure which one of you to contact, so I'm pinging you both, sorry! Would... [23:24:16] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10Nuria) note to self, turnilo neds double quotes and no spaces: {"49":"ACK+URG+FIN","48":"ACK+URG","40":"PSH+URG","36":"RST+URG","34":"SYN+URG","33":"FIN+... [23:31:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10Nuria) {F30393170} [23:34:32] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10ayounsi) LGTM! Thanks!