[07:11:40] 10Data-Engineering: Check home/HDFS leftovers of dsharpe - https://phabricator.wikimedia.org/T310463 (10MoritzMuehlenhoff) [07:18:59] !log Manually rerun webrequest_text laod for hour 2022-06-12T08:00 [07:19:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:55:00] 10Data-Engineering, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Implement wikistats 2 endpoints - https://phabricator.wikimedia.org/T288301 (10JAllemandou) Hi @BPirkle - I can't help with the entangled roots unfortunately - the poor warrior I am would not deal with any magic by any mean :) As fo... [08:27:37] 10Data-Engineering: Check home/HDFS leftovers of dsharpe - https://phabricator.wikimedia.org/T310463 (10Peachey88) [09:51:38] !log Manually rerun webrequest_text laod for hour 2022-06-13T03:00 [09:51:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:54:12] !log Rerun failed refine for mediawiki_talk_page_edit events [09:54:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:54:24] !log rerun failed refine for network_flows_internal [09:54:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:56:26] RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:36:15] 10Data-Engineering, 10Event-Platform, 10Observability-Alerting, 10Patch-For-Review: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10phuedx) >>! In T294911#7991260, @BTullis wrote: > Now the only endpoint which appears to take longer than... [12:22:17] 10Data-Engineering, 10Event-Platform, 10Observability-Alerting, 10Patch-For-Review: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10BTullis) Thanks @phuedx - I think you're correct. According to this information we shouldn't even have th... [12:25:30] 10Data-Engineering, 10Event-Platform, 10Observability-Alerting, 10Patch-For-Review: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10Ottomata) Oh! its just k8s-staging! yes we do not need eventgate deployed to k8s-staging. Nice. [12:26:18] !log restarting hive-server2 and hive-metastore on an-coord1002 [12:26:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:53:26] Heya - mforns, aqu: I plan on doing some airflow code merging - any issue with that? [12:58:04] Hi joal, no problem with me. [12:58:45] ack aqu - Will start with the test-fixtures, then proceed with code adaptation on spark3 patches one by one [12:59:08] Here we go - banzai! [12:59:09] !log havaing failed over hive to an-coord1002 10 minutes ago, I'm restarting hive services on an-coord1001 [12:59:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:04:15] aqu: I could do with a quick brain bounce - would you have a minute? [13:04:27] sure [13:04:35] aqu: batcave! [13:09:34] !log restarting oozie service on an-coord1001 [13:09:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:20:14] !log btullis@datahubsearch1001:~$ sudo systemctl reset-failed ifup@ens13.service T273026 [13:20:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:20:16] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [13:33:21] heya joal aqu just joined, can I help you with the deployments? [13:35:49] Hi mforns - I'm merging for the moment [13:35:57] mforns: I plan on dpeloying after standup if ok for you [13:37:49] joal: of course [13:58:24] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform, 10Patch-For-Review: Add better support for using Event Platform streams with the Flink DataStream API - https://phabricator.wikimedia.org/T310302 (10Ottomata) [13:58:26] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform, 10Patch-For-Review: [Shared Event Platform] Ability to use Event Platform streams in Flink without boilerplate - https://phabricator.wikimedia.org/T308356 (10Ottomata) [14:00:24] !log restarting presto service on an-coord1001 [14:00:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:03:12] Getting kids from school - back at standup [14:06:24] joal: in the reply about AQS -> Druid above, did you mean to link to https://github.com/wikimedia/analytics-aqs/blob/master/lib/druidUtil.js? It looks like you copy/pasted the fake-druid testing harness when you were talking about the query-building DSL you created [14:06:58] (no rush) [14:24:44] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform: Add Event Platform timestamp JSONSchema -> Flink type support - https://phabricator.wikimedia.org/T310495 (10Ottomata) [14:25:45] PROBLEM - AQS root url on aqs2003 is CRITICAL: connect to address 10.192.0.211 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:25:46] PROBLEM - AQS root url on aqs2010 is CRITICAL: connect to address 10.192.48.187 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:44:09] PROBLEM - AQS root url on aqs2005 is CRITICAL: connect to address 10.192.16.42 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:46:43] PROBLEM - AQS root url on aqs2007 is CRITICAL: connect to address 10.192.16.169 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:49:21] PROBLEM - AQS root url on aqs2001 is CRITICAL: connect to address 10.192.0.111 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:49:23] PROBLEM - AQS root url on aqs2004 is CRITICAL: connect to address 10.192.0.212 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:55:14] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10Eevans) These were installed with bullseye (the default), and we have thus far only run Cassandra on <= buster. We are missing the cassandradev component for... [14:55:26] PROBLEM - AQS root url on aqs2012 is CRITICAL: connect to address 10.192.48.189 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:58:22] I assume the AQS alerts are from new hsots in codfw being added to the cluster - btullis can you confirm we can skip? [15:03:16] PROBLEM - AQS root url on aqs2011 is CRITICAL: connect to address 10.192.48.188 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:10:50] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10Eevans) >>! In T307801#7998960, @Eevans wrote: > These were installed with bullseye (the default), and we have thus far only run Cassandra on <= buster. We a... [15:30:53] joal: got 5 mins for a java aestheics q? [15:31:01] sure [15:31:06] https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/804614/2..3/eventutilities-flink/src/main/java/org/wikimedia/eventutilities/flink/formats/json/JsonSchemaConverterNew.java#b219 [15:31:13] still in bc if you wanna [15:31:22] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2001.codfw.wmnet with OS buster [15:31:24] actually ottomata - it's gonna be more than 5mins - let's do it after the airflow meeting :) [15:31:27] okay [15:32:34] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10MatthewVernon) Trying a reimage of aqs2001 with buster. [15:44:57] 10Data-Engineering, 10Airflow: [Airflow] Add DAG subfolder name to error email's subject - https://phabricator.wikimedia.org/T300054 (10mforns) [16:19:34] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2001.codfw.wmnet with OS buster completed: - aqs2001 (**WARN... [16:28:30] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2002.codfw.wmnet with OS buster [16:32:04] 10Data-Engineering: Check home/HDFS leftovers of dsharpe - https://phabricator.wikimedia.org/T310463 (10sbassett) [16:34:01] joal: o/ quick bb? [16:37:41] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster [16:41:05] joal: yes, apologies for missing the ping. Yes we can ignore these. I think that it was caused by moritz.m rebooting these hosts but they are still being set up, so alerting wasn't expected. [16:41:19] ack - thank btullis :) [16:42:30] joal: later better? gonna run an errand real quick if so? [16:42:52] excuse me ottomata - I completely forgot - later is good no prob.em [16:43:09] now okay? or later better for you? [16:43:22] in meeting now actually, so later please [16:43:25] okay! [16:49:12] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster [16:53:50] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster exec... [16:54:02] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster [16:59:51] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster exec... [16:59:55] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster [17:04:42] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster [17:05:48] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster comp... [17:18:57] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [17:22:56] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster [17:24:06] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster comp... [17:29:03] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2002.codfw.wmnet with OS buster completed: - aqs2002 (**WARN... [17:30:01] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster [17:31:21] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster comp... [17:42:26] joal: back! [17:42:43] hi ottomata - you're internal schedule is linked to mine it seems :) [17:42:52] in a good way!? [17:43:03] this time, yes! [17:43:05] :) [17:43:05] yeehaw [17:43:06] bc [17:43:10] OMW [17:47:13] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster comp... [17:49:55] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster comp... [17:53:50] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10Eevans) [17:55:34] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster comp... [18:27:14] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) 05Open→03Resolved Finally resolved this, had some issues with network ports not being correct [19:06:34] Hi mforns - I still have 3 CR pending, with some discussions - Would you have a minute to discuss them? [19:09:53] yes [19:10:02] joal: ^ bc? [19:10:06] Yes! [19:10:44] omw [19:25:47] 10Data-Engineering, 10Airflow: [Airflow] Refactor HDFSArchiveOperator to run in Skein - https://phabricator.wikimedia.org/T310542 (10mforns) [19:39:32] (03PS3) 10Joal: Update geoeditors HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/804574 [21:35:32] 10Data-Engineering, 10Data-Engineering-Kanban, 10Cloud-Services, 10Developer-Advocacy: Data missing on the hierarchical view on the wmcs-edits tool - https://phabricator.wikimedia.org/T310317 (10srishakatux) 05Open→03Resolved Thanks a tonne @Milimetric <3 [21:57:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform, 10Patch-For-Review: Flink output support for Event Platform events - https://phabricator.wikimedia.org/T310218 (10Ottomata) Did a little more research on this today. I think we should write both - `EventJsonRowSerializationSchema i...