[00:21:28] RECOVERY - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:58:19] 10Analytics: Move lexnasser's files before user deletion - https://phabricator.wikimedia.org/T280096 (10lexnasser) @elukey In stat1007, the directories I want to preserve are `/home/lexnasser/lexnasser-stat1007` and `/home/lexnasser/notebook1003`. In aqs-test1001, the directory I want to preserve is `/home/lexn... [05:06:24] 10Analytics, 10Dumps-Generation: Temp files left around in wikistats_1/ ? - https://phabricator.wikimedia.org/T280311 (10ArielGlenn) p:05Triage→03Medium [05:07:15] 10Analytics, 10Dumps-Generation: Temp files left around in wikistats_1/ ? - https://phabricator.wikimedia.org/T280311 (10ArielGlenn) @Ottomata I'm tagging you because you knew about the rsync at some point; if someone else would know better, please feel free to redirect me. Thanks! [05:31:02] 10Analytics, 10ops-eqiad: an-worker1100 disk swap required - https://phabricator.wikimedia.org/T280313 (10elukey) [05:35:11] 10Analytics, 10ops-eqiad: an-worker1100 disk swap required - https://phabricator.wikimedia.org/T280313 (10elukey) [05:40:33] good morning! [05:40:45] going to decom analytics-tool1001 \o/ (old hue host) [05:54:45] 10Analytics, 10Patch-For-Review: Decommission analytics-tool1001 and all the CDH leftovers - https://phabricator.wikimedia.org/T280262 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics-tool1001.eqiad.wmnet` - analytics-tool1001.eqiad.wmnet (**PASS**) -... [05:55:11] 10Analytics, 10Patch-For-Review: Decommission analytics-tool1001 and all the CDH leftovers - https://phabricator.wikimedia.org/T280262 (10elukey) Removed also the hue keytab from krb1001 and puppetmaster1001 [05:55:29] Py2 -> Py3 migration completed! [05:55:39] wow I can finally close the task [05:56:21] 10Analytics-Kanban: Deprecate Python 2 software from the Analytics infrastructure - https://phabricator.wikimedia.org/T204734 (10elukey) [05:56:24] 10Analytics, 10Patch-For-Review: Fix the remaining bugs open on for Hue next - https://phabricator.wikimedia.org/T264896 (10elukey) [05:56:49] 10Analytics-Kanban: Deprecate Python 2 software from the Analytics infrastructure - https://phabricator.wikimedia.org/T204734 (10elukey) 05Open→03Resolved a:03elukey With T280262 I declare this task finally done! [05:57:39] 10Analytics, 10Patch-For-Review: Fix the remaining bugs open on for Hue - https://phabricator.wikimedia.org/T264896 (10elukey) [05:57:45] \o/ \o/ \o/ [07:06:15] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:09:48] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 4 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10awight) @Krinkle wrote, > Broke searchSatisfaction (T280294).... [07:14:34] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10elukey) Procedure for an-master1002: ` add downtime for the host puppet disable merge https://gerrit.wikimedia.org/r/680179 systemctl stop hadoop-hdfs-namenode sys... [07:40:09] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:47:02] 10Analytics: Move lexnasser's files before user deletion - https://phabricator.wikimedia.org/T280096 (10elukey) @fdans you have now these dirs in your home dir on stat1007: ` elukey@stat1007:/home/fdans$ ls -l lexnasser-* lexnasser-aqs-test1001: total 1560 drwxr-xr-x 2 fdans wikidev 4096 Apr 16 07:34 cassand... [07:53:51] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:54:01] 10Analytics, 10Patch-For-Review: Decommission analytics-tool1001 and all the CDH leftovers - https://phabricator.wikimedia.org/T280262 (10elukey) ` root@apt1001:/srv/wikimedia# reprepro --delete clearvanished Deleting vanished identifier 'buster-wikimedia|component/cloudera|amd64'. Deleting vanished identifier... [07:54:44] !log drop all the cloudera packages from our repositories [07:54:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:54:50] * elukey dances [08:05:09] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:31:28] I am going to stop the hadoop daemons on an-master1002 to reshape its partitions [08:44:15] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:47:20] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10elukey) an-master1002 done: ` elukey@an-master1002:~$ sudo lsblk -i NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda... [08:55:39] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:58:02] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10elukey) Procedure for an-master1001: ` disable puppet merge https://gerrit.wikimedia.org/r/680259 failover hdfs and yarn to an-master1002 systemctl stop hadoop-hd... [09:05:33] awight: what RU jobs do you want to check? [09:07:05] elukey: Thanks. This was merged a minute ago, https://gerrit.wikimedia.org/r/c/679390 and I'd like to be sure I'm not deleting data while jobs are running. [09:07:52] so: codemirror templatedata templatewizard visualeditor [09:09:10] While we're chatting, I'm also interested in starting to read about how to migrate my team from Graphite to , I guess Prometheus. And whether I can use tags there to express multidimensional data. [09:09:33] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:09:36] forced puppet on an-launcher1002, all timers gone [09:10:34] awight: for Prometheus I think Filippo is the best poc (he is part of the observability team etc..) [09:10:46] I can try to answer but I am not super expert :) [09:11:05] with tags you mean prometheus labels right? [09:13:14] elukey: oh heck yes. Thanks for the jargon alignment, that's exactly what I was hoping for. I'll ask Filippo about how and whether to migrate. [09:15:46] awight: the other thing is what to migrate to prometheus.. the main issue is that you'd move from a push metrics based approach to a pull metrics one [09:15:57] what we did in analytics was to add prometheus exporters [09:16:13] that collect metrics (or get metrics pushed to them) and allow the prometheus master nodes to pull them [09:16:22] but for spark jobs etc.. we haven't found a solution yet [09:16:30] there is the prometheus push gateway that may help [09:16:41] but I am not sure about its status [09:27:17] elukey: ah wow that's heavier than I expected. All of our jobs so far are reportupdater-queries, which don't seem to have a prometheus integration yet. I can wait, then. [09:31:34] awight: yes then it may be a problem, maybe with Airflow the integration can be better (even if I suspect the answer is not so much) [09:34:07] 10Analytics-Radar, 10observability, 10Graphite, 10Patch-For-Review, and 2 others: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) [09:34:26] elukey: We don't have any preference where these are run BTW, our only requirement is (currently) that the resulting graphs can be accessed publicly. [09:34:41] They're just hive and mariadb queries. [09:35:13] yes yes I can understand, it is something that we'll need to think about [09:35:20] graphite doesn't have a clear deadline yet [09:35:27] so we can rely on it for a bit longer [09:40:24] (03PS1) 10Awight: Update job start dates to only backfill existing data [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/680267 (https://phabricator.wikimedia.org/T279046) [09:42:20] 10Analytics-Radar, 10observability, 10Graphite, 10Patch-For-Review, and 2 others: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) [09:42:56] (03CR) 10Awight: "I missed the SQL job. Will fix that in a follow-up." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676297 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [09:44:37] (03PS1) 10Awight: Escape edit count bucket for Graphite (sql query) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/680270 (https://phabricator.wikimedia.org/T279046) [09:45:14] * elukey errand, bbiab [09:47:05] 10Analytics-Radar, 10observability, 10Graphite, 10Patch-For-Review, and 2 others: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) [09:48:41] 10Analytics-Radar, 10observability, 10Graphite, 10Patch-For-Review, and 2 others: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) @mforns I think this is ready to go now, I've purged Graphite and prepared patches for t... [10:01:19] elukey:, razzi fyi no hosts in hadoop-ui (i guess due to the hue migration) [10:10:04] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:36:29] jbond42: hi! I should have removed it in theory [10:36:46] elukey: thanks <3 [10:37:38] https://gerrit.wikimedia.org/r/c/operations/puppet/+/679892/3/modules/profile/templates/cumin/aliases.yaml.erb [10:37:47] it is strange thought that the error is still there [10:38:03] ahhh hadoop-ui: A:hadoop-hue-cdh or A:hadoop-hue or A:hadoop-yarn [10:38:05] yes my bad [10:38:12] I'll file a change after lunch jbond42 ! [10:38:17] thanks for the ping :) [10:38:22] ack thanks and no hurry [10:38:26] * elukey lunch! [11:14:16] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:25:28] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:30:13] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Escape edit count bucket for Graphite (sql query) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/680270 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [11:50:14] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:55:28] Oh, right. Lunch. Knew I forgot something. bbiab! [12:01:28] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:25:32] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:27:18] RECOVERY - Check unit status of monitor_refine_mediawiki_job_events on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:36:54] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:41:02] PROBLEM - Check unit status of monitor_refine_mediawiki_job_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:50:40] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:02:02] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:04:10] fdans: llengando! [14:04:14] ok! [14:15:06] RECOVERY - Check unit status of monitor_refine_mediawiki_job_events on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:17:52] so while I was investigating the alarms, I noticed [14:17:52] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&var-server=an-coord1001&var-datasource=eqiad%20prometheus%2Fops&refresh=5m&from=now-7d&to=now [14:18:07] for some reason mysql is adding a ton more things since yesterday [14:18:31] sorry since two days ago, the 14th [14:20:10] at this rate we might saturate the partition during the weekend [14:23:41] hm, the partitions that Andrew was saying are being updated (from CamelCase to sadcamelcase) would just be updates in place, right, not additional data in the metastore... [14:24:27] milimetric: this is a good point [14:24:48] the metastore is way bigger than the rest [14:25:04] 1.1G analytics-meta-bin.017982 [14:25:04] 1.1G analytics-meta-bin.017983 [14:25:04] 1.1G analytics-meta-bin.017984 [14:25:04] 4.1G druid [14:25:04] 5.9G oozie [14:25:06] 6.6G hive_metastore [14:26:12] it seems to be growing as we speak, so monitoring du -sh on there should tell us what's growing, then drilling down more we could find the exact table that's getting data and "tail" it [14:26:15] PROBLEM - Check unit status of monitor_refine_mediawiki_job_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:26:37] I don't have rights to do it, but happy to pair and go splunking in the batcave [14:27:01] tail the binlog ? [14:27:33] that would work too but be kind of noisy, I put "tail" in quotes 'cause I was just thinking most tables probably have some sane schema with a timestamp [14:28:28] opening a task in the meantime [14:33:02] 10Analytics: Mysql partition on an-coord1001 sudden change in growth rate since Apr 14th - https://phabricator.wikimedia.org/T280367 (10elukey) p:05Triage→03High [14:37:55] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:49:01] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:56:48] 10Analytics: Mysql partition on an-coord1001 sudden change in growth rate since Apr 14th - https://phabricator.wikimedia.org/T280367 (10Ottomata) Hm, could this be related to {T280293}? It turns out that the Re-refine of data since the 14th that I did yesterday would have duplicated all the partition directorie... [14:59:37] o/ commeented on task elukey milimetric [14:59:44] it would make sense that the binlog grew a bunch yesterday [15:00:00] but it should be done growing for that reason [15:00:16] lookingg into the sanitize test failures [15:00:53] wouldn't it keep growing as it's backfilling all the data? Or is it done backfilling? [15:01:16] 10Analytics: Mysql partition on an-coord1001 sudden change in growth rate since Apr 14th - https://phabricator.wikimedia.org/T280367 (10elukey) If it is the binlog we can definitely drop some of it (`PURGE BINARY LOGS BEFORE etc..`), we keep 14 days IIRC and everything should already be replicated to db1108 and... [15:01:55] its done backfilling [15:02:12] well, job is having trouble, but it shouldn't have to do any backfilling there, since the tables are all lowercased. [15:02:47] i think the monitor job event stuff is just alerting because refine only looks back 28 hours, and refine monitor looks ack 48, and there is missing data for the in between hours [15:02:49] not 100% sure on that [15:02:59] but, i'm inclined to just stop refining job events. [15:03:11] we've never used them in a billion years since i started refining them [15:03:21] and they just cause us maintenance overhead [15:03:27] ottomata: I did some cleanup in the monitor failed flags, there were some leftovers from old timers (some changed name?) but now it should be good [15:04:25] 10Analytics: Mysql partition on an-coord1001 sudden change in growth rate since Apr 14th - https://phabricator.wikimedia.org/T280367 (10elukey) ` elukey@an-coord1001:~$ sudo ls -lht /var/lib/mysql/ | grep '\-bin' -rw-rw---- 1 mysql mysql 811M Apr 16 15:02 analytics-meta-bin.017985 -rw-rw---- 1 mysql mysql 1.1G A... [15:04:30] ottomata: --^ it is the binlog [15:05:06] elukey: aye, and mostly just grew on april 15, right? [15:05:11] yeah [15:05:29] things should change when we'll merge an-coord1001's partitions into /srv [15:05:31] elukey: thank you for clean up, which timers changed names? [15:06:06] ottomata: ah I don't recall now, it was one or two, but there were also old files mentioning refinery-job 0.0.145 under /usr/local/bin [15:06:37] ohhh maybe the failureschecker? [15:07:03] yes yes I think so [15:07:59] ahh sorry [15:08:20] so we could do something like PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00'; [15:08:46] and then I can check over the weekend if more is needed [15:09:02] in theory this should fade away when the backfilling is done right? [15:09:08] by itself I mean [15:09:48] elukey: i think it would just fade away whenver binloggs are rotated away [15:09:52] backfiling is done [15:10:06] re test sanitize failures [15:10:27] i think it is failing because i just added it, and the delayed job is trying to refine some old data...but i'm not totally sure [15:10:40] anyway, i think we shouldn't worry about it, the immediate sanitize is succeeded [15:10:45] succeding with recent data [15:10:51] hmmm [15:11:00] but it will continue to fail as it runs i guess.. [15:11:00] ghm [15:11:36] ottomata: ok so what do you think about the purge above? [15:11:43] just to get some space [15:12:11] elukey: +1 yeah sounds good [15:12:27] as long as replicas are caught up to thend, we don't need those old logs ya/ [15:12:47] ok executing [15:13:33] !log execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00'; [15:13:36] uff [15:13:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:14:10] !log execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00'; on an-coord1001 to free space for /var/lib/mysql - T280367 [15:14:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:14:13] T280367: Mysql partition on an-coord1001 sudden change in growth rate since Apr 14th - https://phabricator.wikimedia.org/T280367 [15:14:25] /dev/mapper/an--coord1001--vg-mysql 59G 42G 18G 71% /var/lib/mysql [15:14:28] better :) [15:15:19] nice ;0 [15:15:37] haha i think my shift key is not working so well, that was supposed to be a :) but I guess ;0 is cool [15:16:28] 10Analytics: Mysql partition on an-coord1001 sudden change in growth rate since Apr 14th - https://phabricator.wikimedia.org/T280367 (10elukey) ` /dev/mapper/an--coord1001--vg-mysql 59G 42G 18G 71% /var/lib/mysql ` We'll have to keep this monitored during the weekend, but we should be good. To be noted:... [15:17:44] ottomata: an-master1002 is now working with only one /srv partition, I'll do the same to an-master1001 but probably not now, too many things in progress [15:17:58] then I'll help Razzi to reimage them to buster with the reuse partitions [15:18:32] an-coord1001 should be done next week (need to schedule the maintenance window) and then we'll have only flerovium/furud to do [15:18:49] so we are close to finish :) [15:21:20] ok great! awesome! [15:23:05] lemme know what I can do for the refine stuff, with all the emails I am a little confused now :D [15:23:15] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:32:24] so annoying ^ [15:32:49] is it a 503 from the API? [15:33:00] I saw it earlier on from the logs [15:33:05] yeah [15:33:05] Apr 16 15:17:05 an-launcher1002 produce_canary_events[29162]: POST https://eventgate-logging-external.svc.eqiad.wmnet:4392/v1/events => BasicHttpResult(failure) status: 503 message: Service Unavailable. Response body: [15:33:14] huh the response body is empty, strange [15:34:31] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:35:07] hm [15:35:07] https://logstash.wikimedia.org/goto/ef535fd7216f9656f15bbb145d33a506 [15:35:13] time="2021-04-16T15:31:20Z" level=warning msg="backtracking required because of match \"*.heap.*\", matching performance may be degraded" source="fsm.go:313" [15:35:13] [15:38:16] oh nm that is staging logs [15:48:27] 10Analytics, 10Patch-For-Review: Decommission analytics-tool1001 and all the CDH leftovers - https://phabricator.wikimedia.org/T280262 (10elukey) Last step is to decide if we want to keep `hue_next` as database name for hue, or to rename it. The current database `hue` should be dropped from all db hosts and al... [15:52:49] elukey: actually [15:52:51] i think this is why [15:52:51] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/680375 [15:52:55] https://phabricator.wikimedia.org/T279342 [15:53:31] this is the task that I was working on for the network failures in our hosts the other day :) [15:53:38] rsyslog was failing to contact the new brokers [15:53:40] sigh [15:53:44] makes sense yes [15:55:58] 10Analytics, 10Event-Platform, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) p:05Low→03Medium [15:57:36] 10Analytics, 10Event-Platform, 10SRE, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) [15:57:46] elukey: just updated https://phabricator.wikimedia.org/T253058 [15:57:50] if we had that this wouldn't happen [15:59:17] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:10:33] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:11:39] (03PS1) 10Ottomata: SanitizeTransformation - Just some simple logging improvements. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/680382 [16:20:58] Hi elukey, question: I saw you create the task for the an-worker1100 disk, and the an-master1001 alert, which you said was related, is gone now; could you explain how? The an-master alert was "Yarn Nodemanagers in unhealthy status" [16:25:01] razzi: hey! So I thought it was related to the yarn nodemanager failing due to the missing partition on an-worker1100, but then I discovered that it was another node, I had to restart the yarn node manager in there for a weird issue [16:25:35] there should be a runbook for it attached to the alert, but basically I went to yarn.wikimedia.org and checked in the main page the "unhealth nodes" [16:25:43] then I went to the failing node and checked logs [16:26:18] razzi: also for an-worker1100 I did more than create a task, I umounted the failing partition and restarted the node manager [16:26:27] I left some comments in the task [16:29:27] the partition on an-worker1100 went read only, and the yarn nodemanager was trying to write on it [16:29:39] puppet was failing to enforce the dir etc.. [16:29:47] gotcha [16:30:00] simply doing umount + fstab comment + run puppet + restart yarn nm does the trick [16:30:20] the reverse should be done after Chris swaps the disk [16:30:38] So now hdfs is aware there is no longer that disk, and will copy the under-replicated data elsewhere? [16:30:53] correct [16:32:19] the datanode is more resilient to failures, we configured it to tolerate 2 disks failed [16:32:32] more is probably something weird that needs investigation [16:37:47] 10Analytics, 10Event-Platform, 10SRE, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) Actually, I'm not sure even just doing LVS would help here. The helmfiles networkpolicy explicitly lists IP addresses that the servic... [17:01:32] folks going afk for the weekend! [17:01:42] talk with you on monday o/ [17:02:18] ottomata: I'll keep checking over the weekend /var/lib/mysql on an-coord1001 just in case, and purge more binlogs if needed [17:02:21] :) [17:03:04] if there are re-runs etc.. to do please send me a ping and I'll do them as well! [17:03:16] ok cool! [17:03:19] all shoudl be good now [17:03:22] have a good weekend! [17:36:56] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Bean go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10Nuria) [18:03:56] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Bean go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10Htriedman) Hi all — I'm Hal Triedman, the new Privacy Engineering intern. Over the last few days, I've been working on re-implementing... [18:22:45] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Beam go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10Ottomata) [19:35:07] (03PS1) 10Razzi: Add eowikivoyage and trwikivoyage to sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/680414 (https://phabricator.wikimedia.org/T279564) [19:43:05] heyyy quick question... are there recommended details about how to create a table in my own Hive database with the results of a HQL query? Just tried the following query, but it errored out... create table andyrussg.it_bh_fr_2021 as select * from wmf.webrequest where uri_query LIKE '%CentralNoticeBannerHistory%' and geocoded_data['country_code']='IT' and year=2021 and month>=3 and month<=4 and [19:43:07] day=25 and webrequest_source='text'; [19:43:31] org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask [19:43:42] and java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTas [19:43:43] k [19:43:48] after about 5 min of execution [19:44:00] (running directly on beeline in console) [19:44:41] (ahh oops also I see a mistake in the query, the day condition shouldn't be there) [19:45:07] still should have worked? [20:04:54] Hi AndyRussG, I was able to create a table using your query with an extra "... and hour = 1" [20:04:54] What's the error you're seeing? [20:07:22] razzi: hi, thanks! the errors mentioned above, nove very informative [20:07:26] *not [20:08:09] maybe some sort of permissions thing? I haven't used my user database in a while [20:08:31] was just going to see if I could create an empty table.. and also try on Hive instead of Beeline [20:09:00] Oh yeah good point, I created my table in hive [20:09:21] K just gonna try there... [20:16:42] razzi: failed on Hive too, also added the hour=1 conditional... full output: https://paste.toolforge.org/view/b4583b54 [20:18:40] this worked fine, though: create table andyrussg.test1 like wmf.webrequest; [20:22:07] Thanks for the paste, let me take a look [20:23:48] razzi: thanks so much!!! ahh btw in about 5 minutes I have to run do a quick errand, but I'd be back in about 1 hr :) [20:25:04] Sounds good AndyRussG. Couple quick questions: where are you running hive, and did you kinit? [20:25:13] My query worked on stat1008.eqiad.wmnet [20:26:30] razzi: on stat1005. Yep, did kinit... other test queries that don't create tables came back fine, except for one, after the failed ctas, actually, that hung and that I had to kill manually in another terminal [20:30:11] AndyRussG: I ran the same query as you and got the same error [20:32:45] Here's mine that worked: `create table razzi.banner_history as select * from wmf.webrequest where uri_query LIKE '%CentralNoticeBannerHistory%' and geocoded_data['country_code']='IT' and year=2021 and month = 3 and day=25 and hour = 1 and webrequest_source='text' limit 10;` [20:32:53] razzi: huh weird! [20:33:04] something about the >= and <= maybe? [20:33:11] just trying select * from wmf.webrequest where uri_query LIKE '%CentralNoticeBannerHistory%' and geocoded_data['country_code']='IT' and year=2021 and month>=3 and month<=4 and day=5 and hour=1 and webrequest_source='text' limit 10; [20:33:40] or maybe multiple conditions on a partition field? [20:33:50] ah yes I changed the month to be =, and I added a limit [20:39:22] yeah that did't work ^ [20:39:31] (the one I pasted) [20:40:09] Exception in thread "main" java.lang.OutOfMemoryError: Java heap space [20:46:03] 10Analytics: Move lexnasser's files before user deletion - https://phabricator.wikimedia.org/T280096 (10lexnasser) @elukey Oops, I totally missed checking HDFS and Hive. From HDFS, I'd like to preserve: 1. `/user/lexnasser/etwiki_sqoop` 2. `/user/lexnasser/labsdb-analytics-mysql-sqoop-tool-pw.txt` 3. `/user/lex... [20:46:32] razzi: yeah your query ^ also worked for me, so must be something with the month bit... gotta run, thanks so much!! :D [20:46:47] sure thing, thanks for stopping by AndyRussG ! [20:46:55] :) [20:53:01] 10Analytics, 10Patch-For-Review: Add time interval limits to pageview API - https://phabricator.wikimedia.org/T261681 (10lexnasser) a:05lexnasser→03None [21:03:47] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 4 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Jdrewniak) I'm noticing a large number of errors related to th... [21:59:47] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) @Marostegui do you have any advice on how to configure clouddb1021 memory / memory alerts? Would it be worth doing to... [22:14:25] razzi: hi again! now getting Out of Memory / Java heap space errors with queries that seemed to be working before, on both stat1005 and stat1008 [22:14:37] https://paste.toolforge.org/view/0cfa7208 [22:14:39] huh, strange [22:14:45] yeah eh! [22:15:05] just tried in both places with the query of yours that worked (without the create table) part [22:15:48] should I be running this somewhere else or something? maybe spark in a new Jupyter notebook? [22:16:37] (I went to the cli just for some initial queries to set up and check this temporary table, but was planning to mostly work from a Jupyter notebook... I can also try there...) [22:16:56] thx again! [22:17:40] AndyRussG: Spark via jupyter notebook is worth a try [22:21:17] razzi: okok I'll try as per https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#PySpark [22:24:36] AndyRussG: cool, let me know how that goes [22:30:17] razzi: K the base query (like above, no create table ^) worked in Jupyter spark... [22:31:43] trying now with (month=3 or month=4) [22:35:18] K that also worked.... [22:35:46] 10Analytics-Radar, 10User-bd808: Reduce partition granularity of hive tables - https://phabricator.wikimedia.org/T273310 (10bd808) I'm just going to drop all these tables. The data loader job stopped over 6 months ago, and nobody actually seems to care about the Action API as an active area of development or i... [22:38:35] AndyRussG: alright! go Spark [22:51:52] 10Analytics-Radar, 10User-bd808: Reduce partition granularity of hive tables - https://phabricator.wikimedia.org/T273310 (10bd808) 05Open→03Resolved ` hive (bd808)> show tables; OK tab_name Time taken: 0.041 seconds ` [22:56:27] razzi: welp a test of a bigger query, without the hour or day condition, but with the limit 10, took a long time, then my internet died, then the jupyter notebook didn't seem to be able to reconnect properly, kept saying "kernel unkown" [22:56:42] so I killed that query using yarn application --kill [22:56:52] and am just gonna try directly the whole table create in jupyter spark [22:59:43] sounds good AndyRussG, thanks for cleaning up the query [22:59:43] Theoretically jupyter should be fine to access with intermittent connectivity, but you could also try the pyspark2 cli under tmux / screen [23:01:14] razzi: okok cool good to know! ahh yeah might have been fine, just didn't really know what to make of the "kernel busy" message (mouseover the kernel icon in the upper-right-hand corner) [23:01:46] ** ^ sorry, not "kernel busy", I mean "kernel unknown" [23:02:33] now it's running the table create fine with the "kernel busy" state, which I think is expected, I'll let you know how it turns out!