[00:46:50] 10Analytics, 10Privacy Engineering, 10Research, 10Patch-For-Review: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140 (10bmansurov) @Isaac thanks for the link. I've been working [[ https://gerrit.wikimedia.org/r/c/analytics/refinery/... [03:13:49] RECOVERY - Yarn Nodemanagers in unhealthy status on an-worker1118 is OK: (C)3 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-backup-hadoop&orgId=1&panelId=46&fullscreen [06:46:32] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Does this need to be merged in sync with another patch?" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659291 (owner: 10Awight) [07:23:18] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Segment CodeMirror metrics by user edit count (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656210 (https://phabricator.wikimedia.org/T273471) (owner: 10Awight) [07:30:28] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Use the edit count bucket sent by TemplateData (033 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659227 (https://phabricator.wikimedia.org/T272569) (owner: 10Andrew-WMDE) [07:52:15] 10Analytics: Presto should warn or prevent users from querying without Hive partition predicates - https://phabricator.wikimedia.org/T273004 (10JAllemandou) Thanks @Ottomata for the fast answer :) Something to note: we currently don't have homogeneous partitioning strategies in term of datasize. What this means... [07:52:37] Good morning [07:57:06] 10Analytics: Druid loading of navigationtiming gets stuck - https://phabricator.wikimedia.org/T273216 (10elukey) In the overlord logs on an-druid1002 I can see this interesting set of logs, than then repeat over and over: ` /var/log/druid/overlord.2.log:2021-01-21T18:00:40,152 INFO org.apache.druid.indexing.ove... [07:57:19] 10Analytics: Druid loading of navigationtiming gets stuck - https://phabricator.wikimedia.org/T273216 (10elukey) p:05Unbreak!→03Medium [07:57:58] bonjour [08:05:30] I am wondering how we should structure the airflow deployment [08:05:48] the Discovery team has a repo only for Airflow's python wheels etc.. [08:06:00] (that I'd like to generalize into "analytics/airflow") [08:06:10] plus their dags into wikimedia/discovery [08:06:18] will we have our DAGs into refinery? [08:06:37] (I'd need to create gerrit repos etc.. this is why I am asking) [08:20:07] 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) @herron in T255973 @razzi is moving partitions to new Kafka Jumbo brokers, and the... [08:35:56] joal: I am going to decom an-worker1117 from the main cluster [08:36:04] and after replication is done, I'll move it to backup [08:36:06] thanks elukey :) [08:36:14] elukey: I know ou know :) [08:39:44] 10Analytics-Clusters, 10SRE: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10fgiunchedi) Thank you @elukey, I don't remember this issue being reported, did the reimage go as expected ? If there are other similar hosts to be reimaged/installed we should definitely keep an e... [08:40:31] 10Analytics-Clusters, 10SRE: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10elukey) >>! In T273412#6795218, @fgiunchedi wrote: > Thank you @elukey, I don't remember this issue being reported, did the reimage go as expected ? If there are other similar hosts to be reimaged... [08:42:21] !log decommission an-worker1117 from the Hadoop cluster, to move it under the Backup cluster [08:42:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:43:27] replication started! [08:43:31] it will take a bit [08:43:41] ack elukey [08:43:43] thanks a lot [08:43:50] joal: are 48T enough or do we need more? [08:44:53] elukey: I think it'll be just enough [08:48:28] 10Analytics-Clusters, 10SRE: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10elukey) 05Open→03Resolved a:03elukey After a chat with Filippo we concluded that the issue was originated due to the temporary root partition being full (it happened for a bit due to presto... [09:00:44] (03CR) 10Awight: [C: 03+1] Use the edit count bucket sent by TemplateData (033 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659227 (https://phabricator.wikimedia.org/T272569) (owner: 10Andrew-WMDE) [09:04:35] (03CR) 10Awight: Segment CodeMirror metrics by user edit count (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656210 (https://phabricator.wikimedia.org/T273471) (owner: 10Awight) [09:05:29] (03CR) 10Awight: "> Does this need to be merged in sync with another patch?" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659291 (owner: 10Awight) [09:43:42] (03CR) 10ZPapierski: [C: 03+1] "LGTM" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/647723 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [10:23:19] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) I have followed https://wikitech.wikimedia.org/wiki/LVS#Add_a_new... [10:41:54] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for later deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/660829 (https://phabricator.wikimedia.org/T273457) (owner: 10Gerrit maintenance bot) [10:47:19] (03PS2) 10Joal: Add mni.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/660833 (https://phabricator.wikimedia.org/T273456) (owner: 10Gerrit maintenance bot) [10:52:00] (03PS3) 10Joal: Add mni.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/660833 (https://phabricator.wikimedia.org/T273456) (owner: 10Gerrit maintenance bot) [10:52:23] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for later deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/660833 (https://phabricator.wikimedia.org/T273456) (owner: 10Gerrit maintenance bot) [11:18:52] joal: https://pasqal.io/2021/01/27/pasqal-and-cineca-to-advance-use-cases-of-neutral-atoms-based-quantum-computers/ [11:24:04] PROBLEM - HDFS Namenode RPC 8020 call queue length on an-master1001 is CRITICAL: 988 ge 20 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [11:24:14] ahahahha whattt [11:24:31] oooopsy [11:24:50] seems an-worker1118 [11:24:54] * elukey stares at joal [11:24:59] :D [11:25:07] * joal hides gently [11:25:34] elukey: shall I kill on job? [11:27:15] joal: yes please, we are also replicating 3M blocks from an-worker1117, it should be ok but if possible I'd tune it a little down [11:27:25] ack [11:27:40] thanks <3 [11:28:17] load reduced by half [11:29:29] joal: let's keep the queue length monitored, if it is too much it will slow down other things in the cluster as well :( [11:30:12] it is very interesting to compare RPC calls vs queue len [11:30:13] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-hadoop_cluster=analytics-hadoop [11:30:56] 3 hours ago we reached the same amount of RPC volume, but it didn't cause a queue [11:31:33] but now we have the under replicated blocks [11:31:52] I think this is the thing elukey [11:34:54] joal: the call queue is still high :( [11:35:10] elukey: killing the other job [11:35:18] thanks a lot, sorry :( [11:35:23] no prob [11:35:27] done [11:35:58] RECOVERY - HDFS Namenode RPC 8020 call queue length on an-master1001 is OK: (C)20 ge (W)10 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [11:36:09] mforns: Just a heads-up, we've been creating a review backlog for you in reportupdater-queries... Please let me know if there's anything I can do to provide background info, etc. I'm not sure how to make these more visible, they're scattered across several tasks. [11:41:44] elukey: my jobs definitely were the cause of the RPC queue - I'm sorry for that [11:41:55] (03PS5) 10Awight: Added visual editor sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [11:42:13] (03CR) 10Awight: "PS 5: compensate for sampling" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [11:42:24] joal: nono I think it was a good learning experience, I am also going to add a specific runbook attached to the icinga alert [11:42:26] elukey: and, it actually makes sense that THAT job made it happend [11:42:30] so people knows etc.. [11:42:51] elukey: the one I killed last was about event_sanitized [11:43:28] elukey: MANY folders and small files, already copied [11:44:12] joal: if even Joseph didn't realize it could have been a problem, nobody else would have for sure, so don't stress too much on it :D [11:44:43] elukey: I was actually wondering when something like that would happen - I was monitoring, but not close enough :) [11:45:05] (my quality bar threshold for a job well done and executed is called THE JOSEPH) [11:45:17] huhu [11:45:21] * joal blushes [11:47:14] I'm confused: VisualEditorFeatureUse is sampled at 1/16, but AIUI can also include oversampled events. However, there is no column to indicate oversampling. How do I remove these events? [11:47:19] elukey: when do you plan to add 1117 to backup? [11:47:57] awight: I'm sorry I really have no idea :( [11:48:16] awight: you'll have to wait for mforns or ottomata :( [11:48:38] +1 will do! [11:49:16] joal: I'd like to let the replication of the blocks to complete, should finish in some hours (but in theory we could do it earlier) [11:49:21] so I'd say tomorrow morning, too late? [11:49:41] 10Analytics-Clusters, 10Analytics-Kanban: WMF-Last-Access cookie breaks Java client - https://phabricator.wikimedia.org/T98396 (10hashar) The same happens on Gerrit which uses `org.apache.httpcomponents:httpclient:4.5.2`. It does not recognizes the `expires` value. Filed as T273605 [11:51:50] no problem elukey - it'll be fine [11:52:11] elukey: may I restart a disctp job and monitor the queue? [11:52:43] joal: sure, but way more gentler please :) [11:52:44] or do you prefer me to wait for re-duplication end? [11:52:57] for sure elukey - gentler [12:02:41] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-01-20): Compensate for sampling - https://phabricator.wikimedia.org/T273454 (10awight) [12:03:37] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#HDFS_Namenode_RPC_length_queue_alerts [12:03:39] (03PS1) 10Awight: Compensate for sampling [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/661108 (https://phabricator.wikimedia.org/T273454) [12:05:24] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-01-20): Compensate for sampling - https://phabricator.wikimedia.org/T273454 (10awight) a:05awight→03None [12:05:28] maybe it is better a specific /Alerts page [12:06:09] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-01-20): Adjust edit count bucketing for TemplateWizard, segment all metrics - https://phabricator.wikimedia.org/T273475 (10awight) [12:06:12] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-01-20): Adjust edit count bucketing for CodeMirror - https://phabricator.wikimedia.org/T273471 (10awight) [12:06:22] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), 10WMDE-TechWish (Sprint-2021-01-20): Adjust edit count bucketing for VisualEditor, segment all metrics - https://phabricator.wikimedia.org/T273474 (10awight) [12:06:34] awesome page elukey [12:06:36] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-01-20): Add edit count bucketing to all metrics - https://phabricator.wikimedia.org/T269986 (10awight) [12:08:46] joal: do you think that we should have Hadoop/Alerts, and then collect stuff in there? [12:08:51] as opposed to /Administration [12:09:12] elukey: yes! [12:09:19] makes a lot of sense [12:09:28] perfect, I'll start the page with this one, and then gradually add all the rest [12:09:34] <3 [12:16:18] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue_alerts [12:16:38] elukey: the RPC call queue is not empty - is the current number too much? [12:17:26] PROBLEM - HDFS Namenode RPC 8020 call queue length on an-master1001 is CRITICAL: 91 ge 20 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [12:17:28] elukey: I have reduced the number of parallel workers heavily [12:17:34] ok - too much [12:17:38] killing the job [12:17:52] yes yes this is a good point, I think that the replication of blocks is heavy enough [12:18:01] right [12:18:37] elukey: actually that's interesting: before the block-rep, I had a jobs with a lot of workers, high number of calls, but no failure [12:19:00] yeah [12:19:08] elukey: I think the block-rep calls might be longer than the ones my jobs do, leading to bottleneck [12:19:21] I'll be patient and wait for your sign elukey :) [12:23:56] RECOVERY - HDFS Namenode RPC 8020 call queue length on an-master1001 is OK: (C)20 ge (W)10 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [12:28:54] joal: I just remembered some work that I wanted to do but never really had the time, namely what indicated in https://community.cloudera.com/t5/Community-Articles/Scaling-the-HDFS-NameNode-part-1/ta-p/246683 [12:29:18] there is an option to set a dedicated service port for datanodes comms [12:29:49] that'd be awesome elukey - makes a lot of sense [12:30:12] and also dfs.namenode.audit.log.async, it feels something that we already discussed, maybe it was not compatible with our hadoop version [12:30:19] also elukey - I noticed that CPU wise, the NN is never very busy - could we grow the number of threads for RPC calss? [12:30:42] could be an option yes [12:31:13] ah yes we already use dfs.namenode.audit.log.async [12:31:15] past Luca did it [12:31:16] :D [12:32:17] :) [12:33:24] and I thin I never enabled the dedicated service port because it is complex [12:33:35] we could do it when the cluster is down in theory [12:33:39] I can test it during these days [12:33:45] great :) [12:50:23] I am going to answer to the alert emails after lunch! [12:50:24] ttl! [13:04:06] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add client_port and is_debug to webrequest 128 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/660031 (https://phabricator.wikimedia.org/T273083) (owner: 10Milimetric) [13:05:12] elukey: (when you're back) just checking that the refinery-source release you do today is 0.1.0 [13:08:10] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Bump refinery-source version to 0.1.0 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/660836 (https://phabricator.wikimedia.org/T273083) (owner: 10Milimetric) [13:09:04] fdans / mforns: would appreciate a review on https://gerrit.wikimedia.org/r/c/analytics/wikistats2/+/660835/ so we can stop thinking about wikistats as much as possible :) [13:10:53] 10Analytics-Data-Quality, 10VisualEditor, 10WMDE-TechWish: Investigate missing dialog close events - https://phabricator.wikimedia.org/T272020 (10awight) We checked whether `dialog-remove` or `dialog-insert` events might explain the missing closes, but they never appear for feature `transclusion`. [13:48:33] 10Analytics: Presto should warn or prevent users from querying without Hive partition predicates - https://phabricator.wikimedia.org/T273004 (10Ottomata) > One way to go about this may be to use hive.max-partitions-per-scan In lieu of better option, this seems to be better than nothing. How about setting this... [13:51:32] (03CR) 10WMDE-Fisch: [C: 03+1] Added visual editor sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [13:57:10] (03CR) 10WMDE-Fisch: Use edit count bucket sent by TemplateWizard (032 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [14:04:10] (03CR) 10WMDE-Fisch: "Should we wait for the VE parts to be merged to exclude the oversamples? So far this looks fine to me." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/661108 (https://phabricator.wikimedia.org/T273454) (owner: 10Awight) [14:05:35] (03CR) 10WMDE-Fisch: Compensate for sampling (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/661108 (https://phabricator.wikimedia.org/T273454) (owner: 10Awight) [14:14:31] milimetric: gotcha! will review [14:15:34] awight, just joined, I will look at (at least) some of your patches today, thanks for the ping [14:15:49] mforns: thanks! [14:53:00] 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10mforns) Hi @EYener! Please, add the schemas (and fields) that you want to be kept indefinitely to the include-list in the [[... [15:15:14] (03CR) 10Mforns: [C: 04-1] "I think there's a config typo: comment inline." (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [15:22:33] 10Analytics: HDFS Namenode: use a separate port for Block Reports and Zookeeper failover - https://phabricator.wikimedia.org/T273629 (10elukey) [15:27:21] (03CR) 10Mforns: Use edit count bucket sent by TemplateWizard (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [15:30:22] (03PS5) 10Awight: Use edit count bucket sent by TemplateWizard [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) [15:30:40] (03CR) 10Awight: Use edit count bucket sent by TemplateWizard (032 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [15:31:26] (03CR) 10WMDE-Fisch: [C: 03+1] Use edit count bucket sent by TemplateWizard [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [15:32:42] milimetric: looking at the cr :) [15:33:32] (03CR) 10Awight: "PS 6: Fix column name" (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [15:33:36] (03PS6) 10Awight: Added visual editor sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [15:33:51] (03CR) 10WMDE-Fisch: [C: 03+1] "We can still improve with a follow up here." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/661108 (https://phabricator.wikimedia.org/T273454) (owner: 10Awight) [15:36:35] (03CR) 10Awight: [C: 03+1] Added visual editor sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [15:39:07] (03CR) 10WMDE-Fisch: [C: 03+1] Added visual editor sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [15:46:52] (03CR) 10Mforns: Compensate for sampling (032 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/661108 (https://phabricator.wikimedia.org/T273454) (owner: 10Awight) [15:50:48] I just applied a change to archiva, building source without any artifacts takes around 6:30 mins [15:50:54] (from stat1004) [15:51:06] the other day I tried and it was more than 10 mins [15:51:15] \o/ [15:56:29] (03CR) 10Awight: Compensate for sampling (032 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/661108 (https://phabricator.wikimedia.org/T273454) (owner: 10Awight) [16:05:50] 10Analytics, 10SRE, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10elukey) @hashar I applied the nginx change to bypass Jetty, can you test again? [16:22:04] 10Analytics, 10SRE, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) Fetching https://archiva.wikimedia.org/repository/mirrored/junit/junit/4.13.1/junit-4.13.1.jar it still takes a while until the transfer starts: | time... [16:26:05] 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) That's really exciting! Yes I'd love do see this happen as well, and am on board... [16:27:01] 10Analytics, 10SRE, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10elukey) I think that we should make tests inside the wikimedia network, testing from home is not reliable (as you said there are too many variables, one above a... [16:27:25] (03CR) 10Mforns: [C: 03+1] "LGTM! Code looks good, and I smoke-tested the UI. Consider this a +2 if you want to merge! Leaving +1 in case someone else wants to chime " [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/660835 (https://phabricator.wikimedia.org/T262725) (owner: 10Milimetric) [16:28:04] elukey: are we going to deployment-train today? [16:28:33] remember version 0.1.0!! [16:28:44] milimetric: :-) yes! [16:28:50] I have a meeting now but yes this is the plan :) [16:28:56] k! [16:29:00] (03CR) 10Milimetric: [C: 03+2] Remove metric groups [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/660835 (https://phabricator.wikimedia.org/T262725) (owner: 10Milimetric) [16:29:03] I want to see if archiva is faster, I applied a perf improvement today [16:29:04] ping me to pair if you want! [16:29:37] sure thing! [16:29:39] fyi fdans: merged that cr, wanted it on the train [16:30:22] (03Merged) 10jenkins-bot: Remove metric groups [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/660835 (https://phabricator.wikimedia.org/T262725) (owner: 10Milimetric) [16:35:45] 10Analytics: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10Ottomata) [16:39:26] !log rebalance kafka partitions for eventlogging_InukaPageView [16:39:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:41:12] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Looks almost like a copy of Ie5abb3f. A notable difference is that the data structure here contains a ….performer.… element." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [16:52:23] 10Analytics, 10Analytics-EventLogging: Uncaught TypeError: navigator.sendBeacon is not a function - https://phabricator.wikimedia.org/T273374 (10Ottomata) @Amorymeltzer what is your browser / user agent? [17:01:07] 10Analytics, 10observability, 10User-fgiunchedi: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10Ottomata) [17:01:58] 10Analytics: Presto should warn or prevent users from querying without Hive partition predicates - https://phabricator.wikimedia.org/T273004 (10JAllemandou) > How about setting this to the number of hours in a 5 weeks Let's do it. While the span might be too small for cases where data is small and therefore cou... [17:14:24] (03CR) 10Milimetric: [C: 03+2] Fix broken month-subtracting logic [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/660908 (https://phabricator.wikimedia.org/T273470) (owner: 10Fdans) [17:16:18] (03Merged) 10jenkins-bot: Fix broken month-subtracting logic [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/660908 (https://phabricator.wikimedia.org/T273470) (owner: 10Fdans) [17:17:55] !log rebalance kafka partitions for eventlogging_CentralNoticeImpression [17:17:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:31:19] elukey: shall we send the email about moving to BigTop, or do you prefer we wait later? [17:41:04] joal: I wanted to make some final tests later on, but yes if you are ok I think we can send it :) [17:41:25] elukey: ok, let's do that together to go faster if you wish? [17:42:16] joal: sure [17:53:06] !log rebalance kafka partitions for eqiad.wdqs-external.sparql-query [17:53:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:03:00] 10Quarry, 10Cloud-Services, 10Cloud-VPS, 10SRE, and 3 others: Quarry should be HTTPS-only - https://phabricator.wikimedia.org/T107627 (10bd808) [18:17:16] mforns: do you have a min? [18:17:25] elukey: yes [18:17:27] bc? [18:17:30] (03PS1) 10Elukey: Update changelog for the 0.1.0 release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/661164 [18:17:57] nah here is fine.. I am wondering if I need to do something for the 0.1.0 release, since maven added the 0.147 snapshot in a precedent commit [18:18:10] besides the changelog update [18:18:36] (03PS3) 10Milimetric: Upgrade to upstream version 1.29.0 [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/655749 (https://phabricator.wikimedia.org/T233336) [18:18:43] hm... [18:19:15] from the docs it seems ok [18:19:30] "The Refinery source commit list and make sure that the changelog.md has been updated with the latest version and possibly that [maven-release-plugin] has committed the related version bump changes (the last step is optional, it could be triggered manually following the instruction below)." [18:19:42] elukey: wait, I'm still trying to understand what you said... :P [18:20:09] one sec, lookin [18:20:51] mforns: basically, git grep 0.147 [18:21:23] I just want to make sure that the pom changes that maven plugin did recently are ok [18:21:37] or if I have to modify also the poms [18:21:55] 10Analytics: Add time interval limits to pageview API - https://phabricator.wikimedia.org/T261681 (10lexnasser) @Milimetric Thanks for clarifying! Should have that completed soon. --- For the interval range restriction, I'm thinking ~1 year is a good limit, since that's what is already used for some other endp... [18:21:56] (03CR) 10Razzi: [V: 03+2 C: 03+2] Upgrade to upstream version 1.29.0 [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/655749 (https://phabricator.wikimedia.org/T233336) (owner: 10Milimetric) [18:22:58] elukey: I see, the development iteration... [18:23:08] yea, hadn't thought about that [18:23:26] 10Analytics, 10Analytics-EventLogging: Uncaught TypeError: navigator.sendBeacon is not a function - https://phabricator.wikimedia.org/T273374 (10Amorymeltzer) @Ottomata Firefox 78, but I also get it in Safari; a quick test suggested I wasn't seeing it in Chrome? [18:23:50] milimetric: where are you deploying Turnilo? To the test node? [18:24:59] elukey: we meant to deploy to only an-tool1007 (staging) but it deployed to staging and production at the same time [18:25:22] yes because scap needs to run with --limit to deploy on one node :) [18:25:29] err wait 1005 is staging, 1007 is live [18:25:43] ah! [18:25:47] we'll update docs [18:25:48] the hosts are both in the target list, but if you scap deploy it does to both [18:26:05] yep, I didn't know about the limit, I was looking for some like -e test [18:26:23] we can also add it as well, surely better [18:26:38] elukey: I'm trying to get to the command that the jenkins job is running [18:26:46] mforns: me too, thanks a lot! [18:27:28] mforns: I am reading https://phabricator.wikimedia.org/T210271 [18:30:05] 10Analytics-Radar, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10wiki_willy) a:03Cmjohnson [18:33:40] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Added visual editor sessions (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [18:38:22] mforns: I see https://phabricator.wikimedia.org/T210271#5996192 [18:38:35] that seems to point out that maven should figure out the version [18:39:04] yes, makes sense! [18:39:34] elukey: but it will figure it out from the snapshot one instead of from the new one [18:39:40] maybe [18:40:12] mforns: yes yes I think as well that it will try to build 0.147, not 0.1.0 :( [18:40:29] there is no override, maybe we should have one for the future [18:40:53] well, but in this case, we can try and change the snapshot versions in the pom [18:41:25] and trust that jenkins maven release will look at snapshot, and not at previous number [18:41:25] yep yep! I am going to update the cr [18:41:43] mforns: if anything goes sideways we blame Dan for the version choice [18:41:57] although... it would be pretty logical that it would do a priorNumber+1 calculation,,, [18:42:04] yes, of course [18:42:08] :] [18:45:26] (03PS2) 10Elukey: Update changelog for the 0.1.0 release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/661164 [18:45:57] mforns: does it make sense --6 ? [18:46:00] --^? [18:46:54] elukey: it does to me! [18:46:56] elukey: can you please update commit title to changelog + pom ? [18:47:01] (03CR) 10Mforns: [C: 03+1] Update changelog for the 0.1.0 release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/661164 (owner: 10Elukey) [18:49:15] joal: yep sure! [18:49:19] thanks elukey [18:52:44] (03PS3) 10Elukey: Update changelog and pom.xml files for the 0.1.0 release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/661164 [18:53:03] ottomata: o/ please let me know if you see any blockers on https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/654658 & https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/647723, would love to ship those relatively soon :) [18:55:59] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10razzi) a:03razzi [18:57:10] elukey: /me does a little dance, we got short urls baby!!! [18:57:31] niceeeeeeeeeeee \o/ [18:57:50] easiest problem that took 10 engineers ever :) [18:57:52] milimetric: if you join #wikimedia-sre and announce it you'll get a lot of love [18:58:04] mforns: all right merging! [18:58:11] elukey: I owe you some love, so you go ahead, take the credit [18:58:22] elukey: cool! [18:58:31] (03CR) 10Elukey: [C: 03+2] Update changelog and pom.xml files for the 0.1.0 release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/661164 (owner: 10Elukey) [18:59:03] milimetric: I'll mention you and Razzi then [19:01:52] elukey: I have manually tested hive, beeline, spark, and I have restarted webrequest oozie job with up-to-date version of the jar - all good [19:02:10] yeeeesssssssssss [19:02:15] * elukey sends wikilove to joal [19:02:15] elukey: IIRC you mentionned refinery failures on test cluster - can you tell me more? [19:02:16] Starting build #70 for job analytics-refinery-maven-release-docker [19:02:18] milimetric: sorry to bother again, could you provide your thoughts on my latest comment: https://phabricator.wikimedia.org/T261681 [19:03:13] joal: two things [19:03:16] 1) https://gerrit.wikimedia.org/r/c/operations/puppet/+/660619 [19:04:18] 2) https://gerrit.wikimedia.org/r/c/operations/puppet/+/660858 but it was more camus-related, I fixed it with Marcel [19:04:34] I haven't seen alarms after the two changes [19:04:42] Right I remeber that elukey [19:04:45] (camus was recording weird raw data dir) [19:05:33] elukey: the meta.dt was a chnge we needed for main cluster as well, right? [19:06:08] joal: yep exactly, it happened for the eventgate switch IIUC [19:06:21] right, and you applied it to test as well [19:07:07] exactly, even if I wasn't aware of it so me and Marcel had to track down the problem :D [19:07:22] :S [19:07:33] following you guys [19:08:41] ok elukey - I confirm we have event data on hdfs - It seems all good for me [19:08:43] while trying to find the code that decides the next release version for analytics-refinery-maven-release-docker [19:09:01] I'm happy to say we can pull the trigger [19:09:57] Project analytics-refinery-maven-release-docker build #70: 04FAILURE in 7 min 42 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/70/ [19:10:20] uou [19:11:17] elukey: also - I started a distcp job with a lot less mappers - RPC-queue seems ok for now [19:11:36] ack [19:11:37] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:2.7:deploy (default-deploy) on project refinery: Failed to deploy metadata: Could not transfer metadata org.wikimedia.analytics.refinery:refinery/maven-metadata.xml from/to archiva.releases (https://archiva.wikimedia.org/repository/releases/): Failed to transfer file: [19:11:42] https://archiva.wikimedia.org/repository/releases/org/wikimedia/analytics/refinery/refinery/maven-metadata.xml. Return code is: 405, ReasonPhrase: Not Allowed. [19:12:11] /o\ [19:12:15] I made a change to archiva earlier on, related to allowing nginx to serve files directly [19:12:22] but I can access the xml [19:12:23] this smells like password expriation :( [19:12:29] ahhhhh [19:12:35] right! Again?? [19:12:40] ? [19:12:51] * razzi off to lunch [19:13:03] joal: well it happens from time to time, sadly :( [19:13:12] yeah [19:13:13] also https://archiva.wikimedia.org/repository/releases/org/wikimedia/analytics/refinery/refinery/maven-metadata.xml is not protected [19:14:07] also the HTTP 405 is method not allowed [19:14:12] it is not a 403 [19:16:29] and I can't find the 405 in the access request log of archiva [19:18:10] dcausse: one nit from zbyzko (but I +2ed and it looks like jenkins is merging it? not sure) [19:18:31] (03CR) 10Ottomata: [C: 03+2] Add rdf-streaming-updater schemas for side outputs [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/647723 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [19:18:49] dcausse: shall I make a release of evnet utils? [19:19:44] ottomata: that'd be great yes! :) [19:20:08] (03CR) 10Awight: [C: 03+1] Added visual editor sessions (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [19:20:22] I think it is related to my change, but I am not really sure [19:21:00] trying to log in as archiva-ci [19:22:19] ottomata: did we save the last archiva-ci password in the pwstore? [19:22:31] there is a commit from me in May 2020 [19:22:35] was it the last time? [19:23:40] ok I guess that I'll change the passzorz [19:25:15] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (10Ottomata) @hashar looks like that bug is fixed? What is needed to get this passing? Thank you! [19:28:02] Starting build #71 for job analytics-refinery-maven-release-docker [19:28:22] !log change archiva-ci password in pwstore, archiva and jenkins [19:28:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:29:51] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-01-20): Add edit count bucketing to all metrics - https://phabricator.wikimedia.org/T269986 (10Ottomata) FYI, @awight https://meta.wikimedia.org/w/index.php?title=S... [19:29:54] !log manually altered event.codemirrorusage to fix incompatible type change: https://phabricator.wikimedia.org/T269986#6797385 [19:29:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:33:27] Project analytics-refinery-maven-release-docker build #71: 04STILL FAILING in 5 min 24 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/71/ [19:33:37] nope [19:34:50] ok I am going to step away for dinner, and I'll restart later, I need to check why archiva is so upset [19:34:54] * elukey afk! dinner [19:40:59] GOne for diner as well - back after [20:06:19] there you go [20:06:20] archiva-ci [02/Feb/2021:19:33:25 +0000] "PUT /repository/releases/org/wikimedia/analytics/refinery/refinery/maven-metadata.xml HTTP/1.1" 405 173 "-" "Apache-Maven/3.5.2 (Java 1.8.0_265; Linux 4.9.0-14-amd64) [20:06:42] the 405 is returned by nginx, since it is a PUT request [20:07:56] so it is definitely related to my patch [20:12:38] I am going to see if I can find a solution, otherwise I'll rollback [20:12:45] 10Analytics, 10SRE, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) From my connection something else is broken download a 2.13M [[ https://archiva.wikimedia.org/repository/releases/com/googlesource/gerrit/plugins/javame... [20:17:07] Ah snap elukey :( [20:24:23] Is there anything I can help with elukey ? [20:28:28] joal: nono I am reverting my perf change, it was working fine but there are some corner cases that I need to study, like PUTs under /repository/etc.. [20:28:38] will kick off the new build in a bit [20:28:41] ok [20:28:49] sorry for the false lead on password elukey :( [20:31:48] Starting build #72 for job analytics-refinery-maven-release-docker [20:32:16] joal: it is fine! [20:33:26] ok - Going off then - see ou tomorrow team [20:35:13] good night! [20:39:00] !log rebalance kafka partitions for codfw.mediawiki.job.htmlCacheUpdate [20:39:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:39:05] !log rebalance kafka partitions for eqiad.mediawiki.job.htmlCacheUpdate [20:39:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:41:49] Yippee, build fixed! [20:41:49] Project analytics-refinery-maven-release-docker build #72: 09FIXED in 10 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/72/ [20:43:37] 10Analytics, 10Better Use Of Data, 10Product-Data-Infrastructure: Define acceptable usage of the `meta` object in event schemas - https://phabricator.wikimedia.org/T273293 (10Ottomata) We are having a fun little philosophical argument, eh? :) > IMO the data is the thing-being-measured I agree with this... [20:44:06] ok so this was unexpected [20:44:15] git tag --list | grep v0.1. [20:44:15] v0.1.0 [20:44:15] v0.1.1 [20:44:15] v0.1.2 [20:44:34] so the two failed builds bumped up the version [20:45:05] mforns: still there? [20:45:11] elukey: yes! [20:45:38] mforns: so today is not my deployment day I think :D [20:46:01] the two failed builds bumped up the version, up to 0.1.2 [20:46:05] elukey: it's super late for you... do you want to do it tomorrow? [20:46:40] yes probably it is good, I can send an email to the team about this versioning [20:46:46] elukey: yes, is the bump up a problem? we can continue with 0.1.3 no? [20:47:04] mforns: I think it is fine but I was wondering if everybody is ok :) [20:47:27] sending an email just in case, will follow up tomorrow morning :) [20:51:33] ok :] [20:54:58] (03PS1) 10Elukey: Update changelog.md with skipped releases [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/661187 [20:55:05] ah mforns I see that I should have updated the changelog with the skipped releases [20:55:22] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery-source#If_the_maven_release_job_failed_(step_2) [20:56:09] now the question is.. should I just merge the changelog.md update, and use 0.1.2 (that doesn't contain the last commit) or release 0.1.3? [20:56:56] no problemo elukey, I think just committing and pushing the change is perfectly fine [20:59:38] mforns: <3 ok then if nobody opposes I'll finish the deployment tomorrow morning, deploy refinery and restart jobs [20:59:50] fine by me [21:00:09] elukey: if you can wait until I join, I can pair with you [21:00:33] mforns: if I get blocked for sure, but I hope to manage it! [21:00:38] thanks a lot for the help <3 [21:00:40] ok [21:02:02] have a good night! [21:02:05] * elukey afk! [21:02:26] 10Analytics, 10Privacy Engineering, 10Research, 10Patch-For-Review: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140 (10Isaac) > I've been working on this patch. This is the main file. Feel free to add comments to the patch with ne... [21:17:52] 10Analytics, 10Analytics-Wikistats: Wikistats Bug: Bulgarian Language... - https://phabricator.wikimedia.org/T273677 (10Altidore86) [21:19:01] 10Analytics, 10Analytics-Wikistats, 10Language codes, 10Language-Team (Language-2020-Focus-Sprint): Wikistats New Feature: Bulgarian... - https://phabricator.wikimedia.org/T273678 (10Altidore86) [22:04:12] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10Dbrant) [22:07:47] 10Analytics: Turnilo "Display Druid query" gives "general error" - https://phabricator.wikimedia.org/T273685 (10CDanis) [22:07:58] 10Analytics: Turnilo "Display Druid query" gives "general error" - https://phabricator.wikimedia.org/T273685 (10CDanis) p:05Triage→03Low [22:13:46] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) [22:53:57] Starting build #5 for job wikimedia-event-utilities-maven-release-docker [22:55:32] Project wikimedia-event-utilities-maven-release-docker build #5: 09SUCCESS in 1 min 34 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/5/ [22:56:55] hey hey, I'm running into this error with superset/presto: "presto error: Hive table 'wmf.webrequest' is corrupt. The number of files in the directory (64) does not match the declared bucket count (256) for partition" [22:58:44] ah, looks like that particular date / time combo is not working [22:59:18] dcausse: https://archiva.wikimedia.org/repository/releases/org/wikimedia/eventutilities/1.0.3/ [22:59:53] tzatziki: that's strange, what date? [23:00:00] ottomata: 2020-11-07 [23:01:33] Seems that it fails for everything before 2020-11-16 ? [23:11:04] (brb) [23:18:30] huh [23:20:04] yeah that is strange! indeed ! [23:20:23] tzatziki: i'd file a bug about that, did something change on that day? not that I know of [23:23:00] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Uncaught TypeError: navigator.sendBeacon is not a function - https://phabricator.wikimedia.org/T273374 (10Milimetric) [23:23:48] ottomata: maybe it was a deleted partition since it was close to 90d, I don't know :) [23:36:02] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Uncaught TypeError: navigator.sendBeacon is not a function - https://phabricator.wikimedia.org/T273374 (10Milimetric) @Amorymeltzer: I believe you, but something's not making sense. `navigator.sendBeacon` has been available in Firefox since v31, in... [23:41:57] 10Analytics: Superset error: `Hive table 'wmf.webrequest' is corrupt. The number of files in the directory (64) does not match the declared bucket count (256) for partition` - https://phabricator.wikimedia.org/T273693 (10jrbs) [23:42:27] tzatziki: yeah but there is data there, as it says [23:42:30] that's what i thought at first too [23:42:59] It's extremely weird since 2020-11-16 works but 2020-11-15 does not [23:46:08] i'll see if jupyter works, though I've not used that before