HC1: OP SIG 21 Oct 2024: Difference between revisions

← Older edit Newer edit →

VisualWikitext

Revision as of 19:10, 4 November 2024

Agenda

What: Meeting to Discuss Improving Node Operations as part of the HC1: OP SIG

When: 21 Oct 2024 @ 8 CET EST

Where: https://matrix.to/#/#sig-op:hc1.chat

Chair: 0x3639

Agenda:

Discuss follow Up items from previous meeting
Document action items
Establish next meeting

If you want to attend please respond (or DM) with your full matrix username and I will invite you to the group. No FUD, anger or BS allowed.

Pre-meeting Notes

0x3639

Added script to backup locally and to digital ocean
Added script to restore from local or digital ocean
Currently testing and debugging both
Fixed bugs
Reviewed Go code for metrics used for monitoring purposes

George

Created intial znnd_prom_exporter with currentHeight, targetHeight, and connectedPeers
Starting to create requirements for OP SIG from HyperQube SIG

Coinselor

The readme file is duplicated. The convention is to use all uppercase. We should delete readme.md
I've implemented minimal logic to address #10 -
- Should it also check for apt?
- Should we check if /etc/debian_version exist?

Meeting Minutes

Mon, Oct 21, 2024, 13:00:48 - deeznnutz: === START OP SIG 21 OCT 2024 ===

Mon, Oct 21, 2024, 13:00:52 - deeznnutz: Hello!

Mon, Oct 21, 2024, 13:01:00 - georgezgeorgez: Hello

Mon, Oct 21, 2024, 13:01:00 - vilkris: Hello

Mon, Oct 21, 2024, 13:01:27 - georgezgeorgez: deeznnutz: did you have any notes/agenda to post first or do we just get into it?

Mon, Oct 21, 2024, 13:01:36 - deeznnutz: yes - just a few

Mon, Oct 21, 2024, 13:01:46 - deeznnutz: Since last meeting we've made progress in the areas we discussed.

In the community poll on priorities we got feedback that node deployment options were important along with real time monitoring. https://forum.hypercore.one/t/community-poll-priority-enhacements-for-deployment-script

After that feedback George wrote some code to enable grafana monitoring of a few node endpoints

syncCurrentHeight\

syncTargetHeight\

networkConnectedPeers

If anyone is interested we did a chatGPT summary of the code here: https://forum.hypercore.one/t/metrics-code-review-for-op-sig-meeting-by-chatgpt/531. I think next steps are to test this and then setup grafana to ingest that endpoint.

Mon, Oct 21, 2024, 13:02:53 - georgezgeorgez: Yup, the next step for me there is to create the dashboards that will allow us to track and visualize a node syncing over time.

Mon, Oct 21, 2024, 13:03:07 - georgezgeorgez: That will also allow us to visualize the impact of vilkris 's performance work.

Mon, Oct 21, 2024, 13:03:35 - georgezgeorgez: We should start thinking about what that test/experiment should look like.

Mon, Oct 21, 2024, 13:03:43 - tapwoot joined the room

Mon, Oct 21, 2024, 13:03:56 - georgezgeorgez: Spin up two nodes at the same time, one with, one without performance improvements, and see how they compare in terms of sync graph?

Mon, Oct 21, 2024, 13:04:29 - georgezgeorgez: I should probably make a thread on the forum for us to start working on that.

Mon, Oct 21, 2024, 13:05:27 - deeznnutz: vilkris: do you expect the performance improvements to improve sync times or just processing new momentums?

Mon, Oct 21, 2024, 13:05:31 - coinselor: Would be nice to get this data in a structured format with as little user interaction as possible.

Mon, Oct 21, 2024, 13:05:48 - georgezgeorgez: <@coinselor:zenon.chat "Would be nice to get this data i..."> Do you mean as an automated test?

Mon, Oct 21, 2024, 13:06:24 - vilkris: We should keep in mind that the goal of the performance improvements weren't to decrease sync time, but to tackle extremely poor performance in specific situations

Mon, Oct 21, 2024, 13:06:37 - vilkris: Decreased sync time might be a positive side effect

Mon, Oct 21, 2024, 13:06:57 - georgezgeorgez: Of course, we have the node exporter metrics to look at as well.

Mon, Oct 21, 2024, 13:07:19 - coinselor: I'm thinking when we have user's test the script, and sync v1/v2, we at least get some output logs or something we can get from them to analyze this data later

Mon, Oct 21, 2024, 13:07:41 - georgezgeorgez: With regards to sync times, we know things get stuck at some problematic momentums.

That would look flat on a sync graph.

Mon, Oct 21, 2024, 13:08:09 - georgezgeorgez: And when we put those graphs next to CPU/RAM graphs, and look at where things goes flat.

We can start to diagnose more problem momentums and their fixes.

Mon, Oct 21, 2024, 13:08:18 - cryptofish: <@vilkris:hc1.chat "We should keep in mind that the ..."> Some if those specific situations can be easily created locally.

Mon, Oct 21, 2024, 13:08:20 - georgezgeorgez: But yes, good point vilkris

Mon, Oct 21, 2024, 13:08:26 - deeznnutz: makes sense... testing do the improvements help sync past the trouble momentums quickly.

Mon, Oct 21, 2024, 13:09:10 - georgezgeorgez: As we mature, we'll have better tools for testing things in isolation.

Mon, Oct 21, 2024, 13:09:27 - georgezgeorgez: The sync graphs combined with the node cpu/ram data is an initial crude way to measure.

Mon, Oct 21, 2024, 13:10:13 - georgezgeorgez: And something easily explainable to the entire community.

Mon, Oct 21, 2024, 13:11:41 - vilkris: Yeah it's a starting point. I've been running a side by side test of syncing a v2 node and a baseline node. V2 is ahead but both are still very slow to sync

Mon, Oct 21, 2024, 13:11:50 - vilkris: This is on VPS

Mon, Oct 21, 2024, 13:12:36 - vilkris: I think this problem is something we might want to give higher priority

Mon, Oct 21, 2024, 13:13:15 - georgezgeorgez: <@cryptofish:hc1.chat "Some if those specific situation..."> When vilkris and i did the code review, we briefly discussed how we could test this locally. At least for the problem of retrieving old state and having to rollback many momentums, constructing that data in a local test case might be difficult. The current local testing is using snapshots right? And we haven't documented how to create that in an easy way.

Mon, Oct 21, 2024, 13:14:18 - georgezgeorgez: I'm looking at performance testing tools. https://k6.io/ is fairly popular and something I'm thinking about how we can make useful.

Mon, Oct 21, 2024, 13:14:21 - vilkris: <@georgezgeorgez:hc1.chat "When vilkris and i did the code ..."> Yes using a snapshot is the easiest way to test a problematic momentum

Mon, Oct 21, 2024, 13:14:48 - deeznnutz: In our matrix chat last week or so we discussed some "backup" / "bootstrapping" options. I think we've learned from the Supernova deployment that having a local bootstrap of the network can be useful. I recommend that NoM / HyperQube operators have their own bootstrap so they don't need to rely on anyone else. I've been working on a backup / restore script that saves the chain data locally so users can backup their chain data and recover it locally. I'm also working on an option to save the data an S3 endpoint.

Mon, Oct 21, 2024, 13:15:03 - cryptofish: <@georgezgeorgez:hc1.chat "When vilkris and i did the code ..."> Cannot remember exactly, but yeah we used an old snapshot of which we knew the height that caused the issue.

Mon, Oct 21, 2024, 13:15:28 - deeznnutz: In supernova operators have been sharing snapshots and for a small network I think it's a really bad idea

Mon, Oct 21, 2024, 13:16:00 - georgezgeorgez: In terms of trustlessness, yeah that's an attack surface.

Mon, Oct 21, 2024, 13:16:12 - vilkris: <@deeznnutz:zenon.chat "In our matrix chat last week or ..."> Are you copying the DB while the node is running?

Mon, Oct 21, 2024, 13:16:18 - georgezgeorgez: That's why I have been only suggesting personal bootstraps

Mon, Oct 21, 2024, 13:16:22 - georgezgeorgez: and not public ones

Mon, Oct 21, 2024, 13:16:36 - deeznnutz: The script simply stops `go-zenon` copies the necessary files, starts `go-zenon` and then tars the files with a date and hash.

Mon, Oct 21, 2024, 13:16:46 - vilkris: Ok gotcha

Mon, Oct 21, 2024, 13:17:26 - coinselor: maybe would be nice to add height to the name, too?

Mon, Oct 21, 2024, 13:17:45 - deeznnutz: ya, good call

Mon, Oct 21, 2024, 13:18:43 - georgezgeorgez: One thing AlienCoder mentioned is bootstrapping as light client. And I agree. Right now it's pretty much just copying the database directly. In the future, we could have it where you can put in hash "checkpoints" of network state. It still downloads momentums from other nodes, but only does the light client verification against the checkpoint data. And then once it catches it, it starts doing full verification.

But that's not really in scope right now.

Mon, Oct 21, 2024, 13:19:35 - georgezgeorgez: That would be a first foray into sentries too. Keeping only some network state and pruning the rest.

Mon, Oct 21, 2024, 13:19:42 - georgezgeorgez: But I don't like to get too ahead of ourselves.

Mon, Oct 21, 2024, 13:20:16 - georgezgeorgez: My point is, we don't need perfect solutions right now.

Mon, Oct 21, 2024, 13:20:54 - georgezgeorgez: What's the next thing for us to do regarding performance improvements?

To help validate and get vilkris's work having an effect?

Mon, Oct 21, 2024, 13:21:33 - georgezgeorgez: I will create dashboards as mentioned. But what is our action item after that?

Run the experiment, showcase the data, and then merge? Then cut a release?

Mon, Oct 21, 2024, 13:22:10 - deeznnutz: <@georgezgeorgez:hc1.chat "I will create dashboards as ment..."> makes sense to me. Also note that several pillars have been testing the code in production.

Mon, Oct 21, 2024, 13:22:35 - deeznnutz: They mostly deployed it to sync "faster"

Mon, Oct 21, 2024, 13:23:09 - coinselor: <@deeznnutz:zenon.chat "They mostly deployed it to sync ..."> They were actually missing momentums, that's why they tried it

Mon, Oct 21, 2024, 13:23:13 - cryptofish: Proof that it solves the fork issue, without side effects.

Mon, Oct 21, 2024, 13:23:43 - georgezgeorgez: <@deeznnutz:zenon.chat "makes sense to me. Also note th..."> That's good to know, but not really good criteria in terms of approving the MRs.

Mon, Oct 21, 2024, 13:24:14 - coinselor: <@georgezgeorgez:hc1.chat "What's the next thing for us to ..."> Making sure it works for future different improvements to the codebase, maybe networking/libp2p upgrades or something else we can think of monitoring

Mon, Oct 21, 2024, 13:24:17 - deeznnutz: yes that too. good point. They tested it for sync faster and missing momentum in production. Both were "improved" just based on no negative feedback after upgrading

Mon, Oct 21, 2024, 13:24:39 - georgezgeorgez: 1. Solves a specific performance issue

2. Does not make overall performance worse.

The local tests have proven 1

But what I'm getting at with the sync graphs is to prove 2

Mon, Oct 21, 2024, 13:24:59 - georgezgeorgez: <@georgezgeorgez:hc1.chat "That's good to know, but not rea..."> We want to put out data, not just anecdotes.

Mon, Oct 21, 2024, 13:26:06 - georgezgeorgez: Note that 2 is overall performance

It's very likely possible to construct specific situations where performance is worse

Mon, Oct 21, 2024, 13:26:39 - georgezgeorgez: Since there is a cost/tradeoff to everything

Mon, Oct 21, 2024, 13:26:52 - deeznnutz: The last item we discussed after our last meeting was implementing Ansible to coordinate pillar / sentry deployment. Seems like the right solution (from my high level research) and now it's just a matter of writing the scripts. I'm happy to research implementing this after getting the backup / restore working

Mon, Oct 21, 2024, 13:27:18 - georgezgeorgez: I can help with that. It's been a while since I've written any ansible, but I used to be fairly proficient.

Mon, Oct 21, 2024, 13:27:41 - georgezgeorgez: We could use the ansible to setup the experiment of running two nodes.

Mon, Oct 21, 2024, 13:27:47 - deeznnutz: Cool. I will need to learn it

Mon, Oct 21, 2024, 13:27:50 - georgezgeorgez: with different builds

Mon, Oct 21, 2024, 13:29:39 - vilkris: <@georgezgeorgez:hc1.chat "What's the next thing for us to ..."> The open MRs are of course a priority but I think the overall sync slowness is something that should be focused on. I don't know if slowness of sync was an option to vote on but it seems to be a constant complaint

Mon, Oct 21, 2024, 13:30:27 - deeznnutz: I think slow sync is a real issue. it can take days to sync sometimes

Mon, Oct 21, 2024, 13:30:43 - deeznnutz: And asking someone to restart every few hours does not seem like a good solution

Mon, Oct 21, 2024, 13:30:48 - georgezgeorgez: I think one thing we've identified is that leveldb is single process. Looking at the cpu graphs, we see it maxing out at like 25%

Mon, Oct 21, 2024, 13:31:06 - georgezgeorgez: Changing it to something like rocksdb is a possibility. But probably not a small undertaking

Mon, Oct 21, 2024, 13:32:11 - vilkris: I suspect the leveldb implementation has issues, but it's still weird the performance is so abysmal even though we're effectively syncing an empty chain

Mon, Oct 21, 2024, 13:32:30 - deeznnutz: it's also strange that restarting `go-zenon` seems to help it

Mon, Oct 21, 2024, 13:32:53 - vilkris: I was looking at the code and noticed that some database references are never explicitly released

Mon, Oct 21, 2024, 13:33:04 - vilkris: Only relying on Go's GC to clean them up

Mon, Oct 21, 2024, 13:33:11 - vilkris: That might be a source of problems

Mon, Oct 21, 2024, 13:33:41 - georgezgeorgez: The sync graphs and looking at them next cpu/ram graphs is my answer to trying to understand more issues. Do you suggest another way we can investigate?

Mon, Oct 21, 2024, 13:34:05 - georgezgeorgez: Open to other approaches

Mon, Oct 21, 2024, 13:35:18 - cryptofish: Isn't it possible to attach an analyser to the process and collect all possible data?

Mon, Oct 21, 2024, 13:35:46 - vilkris: pprof would give some more data which could be useful

Mon, Oct 21, 2024, 13:36:07 - georgezgeorgez: yes but which functions do we start with?

Mon, Oct 21, 2024, 13:36:08 - deeznnutz: I think that is how moon found the memory leak

Mon, Oct 21, 2024, 13:36:16 - vilkris: Also creating a way to monitor leveldb metrics

Mon, Oct 21, 2024, 13:36:54 - vilkris: Or just replacing leveldb altogether and seeing if there is a difference

Mon, Oct 21, 2024, 13:36:57 - georgezgeorgez: We probably want to create a set of Go Benchmark tests, but what exactly do we benchmark? Not sure yet

Mon, Oct 21, 2024, 13:37:33 - deeznnutz: LevelDB provides a GetProperty method that allows you to retrieve internal statistics and properties. You can use this method to access various metrics:

Mon, Oct 21, 2024, 13:37:47 - vilkris: If the leveldb library is the main problem then there's little we can do but change it

Mon, Oct 21, 2024, 13:38:24 - vilkris: Its codebase is no longer maintained

Mon, Oct 21, 2024, 13:38:42 - vilkris: So at some point it will probably have to go

Mon, Oct 21, 2024, 13:39:18 - deeznnutz: we upgraded supernova to `rocksdb`

Mon, Oct 21, 2024, 13:39:48 - vilkris: <@deeznnutz:zenon.chat "we upgraded supernova to `rocksd..."> I've been looking at replacing leveldb with pebble

Mon, Oct 21, 2024, 13:39:52 - deeznnutz: we == AC

Mon, Oct 21, 2024, 13:39:56 - coinselor: it has slightly higher hardware requirements, if I remember correctly. Something to consider.

Mon, Oct 21, 2024, 13:40:28 - georgezgeorgez: https://github.com/cockroachdb/pebble

Mon, Oct 21, 2024, 13:40:29 - deeznnutz: Message deleted

Mon, Oct 21, 2024, 13:40:34 - georgezgeorgez: haha nice

Mon, Oct 21, 2024, 13:41:04 - georgezgeorgez: I want to make sure that we have a decision making framework in place

Mon, Oct 21, 2024, 13:41:11 - georgezgeorgez: for these performance improvements

Mon, Oct 21, 2024, 13:41:21 - georgezgeorgez: e.g. criteria, testing, etc

Mon, Oct 21, 2024, 13:41:24 - georgezgeorgez: data driven

Mon, Oct 21, 2024, 13:41:32 - georgezgeorgez: it can improve over time

Mon, Oct 21, 2024, 13:42:37 - deeznnutz: "Pebble was introduced as an alternative storage engine to RocksDB in CockroachDB v20.1 (released May 2020) and was used in production successfully at that time. Pebble was made the default storage engine in CockroachDB v20.2 (released Nov 2020). Pebble is being used in production by users of CockroachDB at scale and is considered stable and production ready."

Mon, Oct 21, 2024, 13:42:57 - georgezgeorgez: Maybe a good segue.

Changing the db is something we can try first on hyperqube_z/hyperqube_hc1_betanet (name still up in the air)

Mon, Oct 21, 2024, 13:43:37 - georgezgeorgez: The pillars are signalling incentives for operational support for getting it launched.

Mon, Oct 21, 2024, 13:43:47 - georgezgeorgez: And I would task 0x and this SIG with those tasks.

Mon, Oct 21, 2024, 13:44:10 - georgezgeorgez: I would want to come up with a hardware requirement

Mon, Oct 21, 2024, 13:44:22 - georgezgeorgez: For hyperqube_z, I might throttle the network to less than mainnet actually.

Mon, Oct 21, 2024, 13:44:36 - georgezgeorgez: Most of the txs will be voting for now, and it will help us test dynamic plasma more easily once that is ready

Mon, Oct 21, 2024, 13:44:43 - vilkris: That would make sense

Mon, Oct 21, 2024, 13:44:44 - georgezgeorgez: Message deleted

Mon, Oct 21, 2024, 13:44:54 - georgezgeorgez: It would also mean lesser node requirement than mainnet

Mon, Oct 21, 2024, 13:45:19 - georgezgeorgez: But I want to give an actual recommendation for the hardware

Mon, Oct 21, 2024, 13:45:20 - vilkris: If mainnet had all full momentums the node requirements would be very high unfortunately

Mon, Oct 21, 2024, 13:45:30 - vilkris: Because of issues with the node implementation

Mon, Oct 21, 2024, 13:45:40 - georgezgeorgez: This would also give another usecase for the deploy script that the SIG has been working on

Mon, Oct 21, 2024, 13:45:52 - georgezgeorgez: <@vilkris:hc1.chat "If mainnet had all full momentum..."> With K6, I want to start testing things like this

Mon, Oct 21, 2024, 13:46:22 - deeznnutz: <@vilkris:hc1.chat "Because of issues with the node ..."> issues beyond what your performance improvements aim to fix and the issues w/ leveldb potentially?

Mon, Oct 21, 2024, 13:46:50 - georgezgeorgez: I know one pillar has asked if they can run the hyperqube_z node alongside their mainnet pillar node.

I probably wouldn't suggest it, but we might want to introduce some options to prevent clashes

Mon, Oct 21, 2024, 13:47:00 - georgezgeorgez: service names, directory locations, port numbers

Mon, Oct 21, 2024, 13:47:19 - georgezgeorgez: It gets into the ansible work as well, running multiple nodes

Mon, Oct 21, 2024, 13:47:34 - georgezgeorgez: only would want to have 1 monitoring stack

Mon, Oct 21, 2024, 13:47:40 - georgezgeorgez: that both nodes are sending data to

Mon, Oct 21, 2024, 13:48:21 - vilkris: <@deeznnutz:zenon.chat "issues beyond what your performa..."> If the db issues are fixed there are still some problems but they can be alleviated by utilizing the cache from V2

Mon, Oct 21, 2024, 13:48:43 - deeznnutz: down to about 10 minutes.

Mon, Oct 21, 2024, 13:48:44 - deeznnutz: The only other item we discussed was setting up a testnet automatically. This is one area where I think someone else might be better suited to lead the charge. I can automate the steps in CF's repo (to setup a testnet) but I think we discussed more efficient (better) ways of doing it. I think George had some good ideas and libraries / apps to help

Mon, Oct 21, 2024, 13:49:24 - georgezgeorgez: i apologize in advance for the splits if they are annoying

but I will probably setup a Test Tooling SIG in the near future

Mon, Oct 21, 2024, 13:49:41 - georgezgeorgez: local devnets should be accessible to all devs now

Mon, Oct 21, 2024, 13:50:04 - georgezgeorgez: i think fish's instructions use nomctl's devnet create action

Mon, Oct 21, 2024, 13:50:19 - cryptofish: <@georgezgeorgez:hc1.chat "i think fish's instructions use ..."> Correct

Mon, Oct 21, 2024, 13:50:45 - georgezgeorgez: I don't want to get too much into it for this SIG, but the dynamic testnet spin up work will go nicely into extension chain setup as well.

Mon, Oct 21, 2024, 13:50:57 - georgezgeorgez: We could possibly implement it as a smart contract on hyperqube_z

Mon, Oct 21, 2024, 13:51:38 - georgezgeorgez: And that would serve as a start for the extension chain embedded

Mon, Oct 21, 2024, 13:51:54 - georgezgeorgez: it's the same concepts, agreeing on initial validators, genesis, time to start, etc

Mon, Oct 21, 2024, 13:52:06 - georgezgeorgez: but not relevant to us here right now

Mon, Oct 21, 2024, 13:52:26 - deeznnutz: Then those were the follow up items from last meeting

Mon, Oct 21, 2024, 13:52:46 - deeznnutz: I think the 4w cadence works well.

Mon, Oct 21, 2024, 13:52:55 - deeznnutz: and this time is fine for me too

Mon, Oct 21, 2024, 13:53:56 - coinselor: maybe we add `--debug` to the wishlist? maybe it builds a diff go-zenon branch that includes pprof/etc tooling?

Mon, Oct 21, 2024, 13:54:17 - georgezgeorgez: What's our action items?

I will get the sync Graphs created and integrated into our deployment process.

Once we have the modified code for hyperqube_z, we will need to write instructions and give a hardware recommendation.

Is there something we can do regarding leveldb right now?

Mon, Oct 21, 2024, 13:54:41 - deeznnutz: I will work on finishing backup / restore locallly

Mon, Oct 21, 2024, 13:55:31 - georgezgeorgez: deeznnutz: we can do an ansible crash course in this chat

Mon, Oct 21, 2024, 13:55:41 - georgezgeorgez: I'll walk you through some things

Mon, Oct 21, 2024, 13:56:30 - vilkris: <@georgezgeorgez:hc1.chat "What's our action items?"> With regard to leveldb - I'm interested in testing whether explicitly cleaning up the resources has an affect on the sync. And if the database is replaced there has to be a mechanism to cleanup database references since other databases explicitly require it

Mon, Oct 21, 2024, 13:56:56 - vilkris: This means that quite a few places in the code have to be touched

Mon, Oct 21, 2024, 13:57:36 - georgezgeorgez: Makes sense. I'm hoping once the incubator process is more mature and on-chain on hyperqube_z/hc1_betanet, that we can start getting forward incentives for this kind of work too.

Mon, Oct 21, 2024, 13:57:53 - georgezgeorgez: I've asked 0x to start filling out a OP SIG WP1

Mon, Oct 21, 2024, 13:58:18 - deeznnutz: Anything else before we close this meeting out?

Mon, Oct 21, 2024, 13:58:27 - georgezgeorgez: Next meeting time

Mon, Oct 21, 2024, 13:58:48 - georgezgeorgez: This should usually work for me

it's possible i have a conflict but that's fine

Mon, Oct 21, 2024, 13:58:51 - deeznnutz: 18 Nov at the same time?

Mon, Oct 21, 2024, 13:59:05 - vilkris: Should work yes

Mon, Oct 21, 2024, 13:59:20 - deeznnutz: Cool - thanks everyone!

Mon, Oct 21, 2024, 13:59:37 - deeznnutz: === END OP SIG 21 Oct 2024 ===

@@ Line 39: / Line 39: @@
 == Meeting Minutes ==
+Mon, Oct 21, 2024, 13:00:48 - deeznnutz: === START OP SIG 21 OCT 2024 ===
+Mon, Oct 21, 2024, 13:00:52 - deeznnutz: Hello!
+Mon, Oct 21, 2024, 13:01:00 - georgezgeorgez: Hello
+Mon, Oct 21, 2024, 13:01:00 - vilkris: Hello
+Mon, Oct 21, 2024, 13:01:27 - georgezgeorgez: deeznnutz: did you have any notes/agenda to post first or do we just get into it?
+Mon, Oct 21, 2024, 13:01:36 - deeznnutz: yes - just a few
+Mon, Oct 21, 2024, 13:01:46 - deeznnutz: Since last meeting we've made progress in the areas we discussed.
+In the community poll on priorities we got feedback that node deployment options were important along with real time monitoring. https://forum.hypercore.one/t/community-poll-priority-enhacements-for-deployment-script
+After that feedback George wrote some code to enable grafana monitoring of a few node endpoints
+* syncCurrentHeight\
+* syncTargetHeight\
+* networkConnectedPeers
+If anyone is interested we did a chatGPT summary of the code here:  https://forum.hypercore.one/t/metrics-code-review-for-op-sig-meeting-by-chatgpt/531.  I think next steps are to test this and then setup grafana to ingest that endpoint.
+Mon, Oct 21, 2024, 13:02:53 - georgezgeorgez: Yup, the next step for me there is to create the dashboards that will allow us to track and visualize a node syncing over time.
+Mon, Oct 21, 2024, 13:03:07 - georgezgeorgez: That will also allow us to visualize the impact of vilkris 's performance work.
+Mon, Oct 21, 2024, 13:03:35 - georgezgeorgez: We should start thinking about what that test/experiment should look like.
+Mon, Oct 21, 2024, 13:03:43 - tapwoot joined the room
+Mon, Oct 21, 2024, 13:03:56 - georgezgeorgez: Spin up two nodes at the same time, one with, one without performance improvements, and see how they compare in terms of sync graph?
+Mon, Oct 21, 2024, 13:04:29 - georgezgeorgez: I should probably make a thread on the forum for us to start working on that.
+Mon, Oct 21, 2024, 13:05:27 - deeznnutz: vilkris: do you expect the performance improvements to improve sync times or just processing new momentums?
+Mon, Oct 21, 2024, 13:05:31 - coinselor: Would be nice to get this data in a structured format with as little user interaction as possible.
+Mon, Oct 21, 2024, 13:05:48 - georgezgeorgez: <@coinselor:zenon.chat "Would be nice to get this data i..."> Do you mean as an automated test?
+Mon, Oct 21, 2024, 13:06:24 - vilkris: We should keep in mind that the goal of the performance improvements weren't to decrease sync time, but to tackle extremely poor performance in specific situations
+Mon, Oct 21, 2024, 13:06:37 - vilkris: Decreased sync time might be a positive side effect
+Mon, Oct 21, 2024, 13:06:57 - georgezgeorgez: Of course, we have the node exporter metrics to look at as well.
+Mon, Oct 21, 2024, 13:07:19 - coinselor: I'm thinking when we have user's test the script, and sync v1/v2, we at least get some output logs or something we can get from them to analyze this data later
+Mon, Oct 21, 2024, 13:07:41 - georgezgeorgez: With regards to sync times, we know things get stuck at some problematic momentums.
+That would look flat on a sync graph.
+Mon, Oct 21, 2024, 13:08:09 - georgezgeorgez: And when we put those graphs next to CPU/RAM graphs, and look at where things goes flat.
+We can start to diagnose more problem momentums and their fixes.
+Mon, Oct 21, 2024, 13:08:18 - cryptofish: <@vilkris:hc1.chat "We should keep in mind that the ..."> Some if those specific situations can be easily created locally.
+Mon, Oct 21, 2024, 13:08:20 - georgezgeorgez: But yes, good point vilkris
+Mon, Oct 21, 2024, 13:08:26 - deeznnutz: makes sense... testing do the improvements help sync past the trouble momentums quickly.
+Mon, Oct 21, 2024, 13:09:10 - georgezgeorgez: As we mature, we'll have better tools for testing things in isolation.
+Mon, Oct 21, 2024, 13:09:27 - georgezgeorgez: The sync graphs combined with the node cpu/ram data is an initial crude way to measure.
+Mon, Oct 21, 2024, 13:10:13 - georgezgeorgez: And something easily explainable to the entire community.
+Mon, Oct 21, 2024, 13:11:41 - vilkris: Yeah it's a starting point. I've been running a side by side test of syncing a v2 node and a baseline node. V2 is ahead but both are still very slow to sync
+Mon, Oct 21, 2024, 13:11:50 - vilkris: This is on VPS
+Mon, Oct 21, 2024, 13:12:36 - vilkris: I think this problem is something we might want to give higher priority
+Mon, Oct 21, 2024, 13:13:15 - georgezgeorgez: <@cryptofish:hc1.chat "Some if those specific situation..."> When vilkris and i did the code review, we briefly discussed how we could test this locally. At least for the problem of retrieving old state and having to rollback many momentums, constructing that data in a local test case might be difficult. The current local testing is using snapshots right? And we haven't documented how to create that in an easy way.
+Mon, Oct 21, 2024, 13:14:18 - georgezgeorgez: I'm looking at performance testing tools. <nowiki>https://k6.io/</nowiki> is fairly popular and something I'm thinking about how we can make useful.
+Mon, Oct 21, 2024, 13:14:21 - vilkris: <@georgezgeorgez:hc1.chat "When vilkris and i did the code ..."> Yes using a snapshot is the easiest way to test a problematic momentum
+Mon, Oct 21, 2024, 13:14:48 - deeznnutz: In our matrix chat last week or so we discussed some "backup" / "bootstrapping" options.  I think we've learned from the Supernova deployment that having a local bootstrap of the network can be useful. I recommend that NoM / HyperQube operators have their own bootstrap so they don't need to rely on anyone else.  I've been working on a backup / restore script that saves the chain data locally so users can backup their chain data and recover it locally.  I'm also working on an option to save the data an S3 endpoint.
+Mon, Oct 21, 2024, 13:15:03 - cryptofish: <@georgezgeorgez:hc1.chat "When vilkris and i did the code ..."> Cannot remember exactly, but yeah we used an old snapshot of which we knew the height that caused the issue.
+Mon, Oct 21, 2024, 13:15:28 - deeznnutz: In supernova operators have been sharing snapshots and for a small network I think it's a really bad idea
+Mon, Oct 21, 2024, 13:16:00 - georgezgeorgez: In terms of trustlessness, yeah that's an attack surface.
+Mon, Oct 21, 2024, 13:16:12 - vilkris: <@deeznnutz:zenon.chat "In our matrix chat last week or ..."> Are you copying the DB while the node is running?
+Mon, Oct 21, 2024, 13:16:18 - georgezgeorgez: That's why I have been only suggesting personal bootstraps
+Mon, Oct 21, 2024, 13:16:22 - georgezgeorgez: and not public ones
+Mon, Oct 21, 2024, 13:16:36 - deeznnutz: The script simply stops `go-zenon` copies the necessary files, starts `go-zenon` and then tars the files with a date and hash.
+Mon, Oct 21, 2024, 13:16:46 - vilkris: Ok gotcha
+Mon, Oct 21, 2024, 13:17:26 - coinselor: maybe would be nice to add height to the name, too?
+Mon, Oct 21, 2024, 13:17:45 - deeznnutz: ya, good call
+Mon, Oct 21, 2024, 13:18:43 - georgezgeorgez: One thing AlienCoder mentioned is bootstrapping as light client. And I agree. Right now it's pretty much just copying the database directly. In the future, we could have it where you can put in hash "checkpoints" of network state. It still downloads momentums from other nodes, but only does the light client verification against the checkpoint data. And then once it catches it, it starts doing full verification.
+But that's not really in scope right now.
+Mon, Oct 21, 2024, 13:19:35 - georgezgeorgez: That would be a first foray into sentries too. Keeping only some network state and pruning the rest.
+Mon, Oct 21, 2024, 13:19:42 - georgezgeorgez: But I don't like to get too ahead of ourselves.
+Mon, Oct 21, 2024, 13:20:16 - georgezgeorgez: My point is, we don't need perfect solutions right now.
+Mon, Oct 21, 2024, 13:20:54 - georgezgeorgez: What's the next thing for us to do regarding performance improvements?
+To help validate and get vilkris's work having an effect?
+Mon, Oct 21, 2024, 13:21:33 - georgezgeorgez: I will create dashboards as mentioned. But what is our action item after that?
+Run the experiment, showcase the data, and then merge? Then cut a release?
+Mon, Oct 21, 2024, 13:22:10 - deeznnutz: <@georgezgeorgez:hc1.chat "I will create dashboards as ment..."> makes sense to me.  Also note that several pillars have been testing the code in production.
+Mon, Oct 21, 2024, 13:22:35 - deeznnutz: They mostly deployed it to sync "faster"
+Mon, Oct 21, 2024, 13:23:09 - coinselor: <@deeznnutz:zenon.chat "They mostly deployed it to sync ..."> They were actually missing momentums, that's why they tried it
+Mon, Oct 21, 2024, 13:23:13 - cryptofish: Proof that it solves the fork issue, without side effects.
+Mon, Oct 21, 2024, 13:23:43 - georgezgeorgez: <@deeznnutz:zenon.chat "makes sense to me.  Also note th..."> That's good to know, but not really good criteria in terms of approving the MRs.
+Mon, Oct 21, 2024, 13:24:14 - coinselor: <@georgezgeorgez:hc1.chat "What's the next thing for us to ..."> Making sure it works for future different improvements to the codebase, maybe networking/libp2p upgrades or something else we can think of monitoring
+Mon, Oct 21, 2024, 13:24:17 - deeznnutz: yes that too.  good point.  They tested it for sync faster and missing momentum in production.  Both were "improved" just based on no negative feedback after upgrading
+Mon, Oct 21, 2024, 13:24:39 - georgezgeorgez: 1. Solves a specific performance issue
+. Does not make overall performance worse.
+The local tests have proven 1
+But what I'm getting at with the sync graphs is to prove 2
+Mon, Oct 21, 2024, 13:24:59 - georgezgeorgez: <@georgezgeorgez:hc1.chat "That's good to know, but not rea..."> We want to put out data, not just anecdotes.
+Mon, Oct 21, 2024, 13:26:06 - georgezgeorgez: Note that 2 is overall performance
+It's very likely possible to construct specific situations where performance is worse
+Mon, Oct 21, 2024, 13:26:39 - georgezgeorgez: Since there is a cost/tradeoff to everything
+Mon, Oct 21, 2024, 13:26:52 - deeznnutz: The last item we discussed after our last meeting was implementing Ansible to coordinate pillar / sentry deployment.  Seems like the right solution (from my high level research) and now it's just a matter of writing the scripts.  I'm happy to research implementing this after getting the backup / restore working
+Mon, Oct 21, 2024, 13:27:18 - georgezgeorgez: I can help with that. It's been a while since I've written any ansible, but I used to be fairly proficient.
+Mon, Oct 21, 2024, 13:27:41 - georgezgeorgez: We could use the ansible to setup the experiment of running two nodes.
+Mon, Oct 21, 2024, 13:27:47 - deeznnutz: Cool.  I will need to learn it
+Mon, Oct 21, 2024, 13:27:50 - georgezgeorgez: with different builds
+Mon, Oct 21, 2024, 13:29:39 - vilkris: <@georgezgeorgez:hc1.chat "What's the next thing for us to ..."> The open MRs are of course a priority but I think the overall sync slowness is something that should be focused on. I don't know if slowness of sync was an option to vote on but it seems to be a constant complaint
+Mon, Oct 21, 2024, 13:30:27 - deeznnutz: I think slow sync is a real issue.  it can take days to sync sometimes
+Mon, Oct 21, 2024, 13:30:43 - deeznnutz: And asking someone to restart every few hours does not seem like a good solution
+Mon, Oct 21, 2024, 13:30:48 - georgezgeorgez: I think one thing we've identified is that leveldb is single process. Looking at the cpu graphs, we see it maxing out at like 25%
+Mon, Oct 21, 2024, 13:31:06 - georgezgeorgez: Changing it to something like rocksdb is a possibility. But probably not a small undertaking
+Mon, Oct 21, 2024, 13:32:11 - vilkris: I suspect the leveldb implementation has issues, but it's still weird the performance is so abysmal even though we're effectively syncing an empty chain
+Mon, Oct 21, 2024, 13:32:30 - deeznnutz: it's also strange that restarting `go-zenon` seems to help it
+Mon, Oct 21, 2024, 13:32:53 - vilkris: I was looking at the code and noticed that some database references are never explicitly released
+Mon, Oct 21, 2024, 13:33:04 - vilkris: Only relying on Go's GC to clean them up
+Mon, Oct 21, 2024, 13:33:11 - vilkris: That might be a source of problems
+Mon, Oct 21, 2024, 13:33:41 - georgezgeorgez: The sync graphs and looking at them next cpu/ram graphs is my answer to trying to understand more issues. Do you suggest another way we can investigate?
+Mon, Oct 21, 2024, 13:34:05 - georgezgeorgez: Open to other approaches
+Mon, Oct 21, 2024, 13:35:18 - cryptofish: Isn't it possible to attach an analyser to the process and collect all possible data?
+Mon, Oct 21, 2024, 13:35:46 - vilkris: pprof would give some more data which could be useful
+Mon, Oct 21, 2024, 13:36:07 - georgezgeorgez: yes but which functions do we start with?
+Mon, Oct 21, 2024, 13:36:08 - deeznnutz: I think that is how moon found the memory leak
+Mon, Oct 21, 2024, 13:36:16 - vilkris: Also creating a way to monitor leveldb metrics
+Mon, Oct 21, 2024, 13:36:54 - vilkris: Or just replacing leveldb altogether and seeing if there is a difference
+Mon, Oct 21, 2024, 13:36:57 - georgezgeorgez: We probably want to create a set of Go Benchmark tests, but what exactly do we benchmark? Not sure yet
+Mon, Oct 21, 2024, 13:37:33 - deeznnutz: LevelDB provides a GetProperty method that allows you to retrieve internal statistics and properties. You can use this method to access various metrics:
+Mon, Oct 21, 2024, 13:37:47 - vilkris: If the leveldb library is the main problem then there's little we can do but change it
+Mon, Oct 21, 2024, 13:38:24 - vilkris: Its codebase is no longer maintained
+Mon, Oct 21, 2024, 13:38:42 - vilkris: So at some point it will probably have to go
+Mon, Oct 21, 2024, 13:39:18 - deeznnutz: we upgraded supernova to `rocksdb`
+Mon, Oct 21, 2024, 13:39:48 - vilkris: <@deeznnutz:zenon.chat "we upgraded supernova to `rocksd..."> I've been looking at replacing leveldb with pebble
+Mon, Oct 21, 2024, 13:39:52 - deeznnutz: we == AC
+Mon, Oct 21, 2024, 13:39:56 - coinselor: it has slightly higher hardware requirements, if I remember correctly. Something to consider.
+Mon, Oct 21, 2024, 13:40:28 - georgezgeorgez: https://github.com/cockroachdb/pebble
+Mon, Oct 21, 2024, 13:40:29 - deeznnutz: Message deleted
+Mon, Oct 21, 2024, 13:40:34 - georgezgeorgez: haha nice
+Mon, Oct 21, 2024, 13:41:04 - georgezgeorgez: I want to make sure that we have a decision making framework in place
+Mon, Oct 21, 2024, 13:41:11 - georgezgeorgez: for these performance improvements
+Mon, Oct 21, 2024, 13:41:21 - georgezgeorgez: e.g. criteria, testing, etc
+Mon, Oct 21, 2024, 13:41:24 - georgezgeorgez: data driven
+Mon, Oct 21, 2024, 13:41:32 - georgezgeorgez: it can improve over time
+Mon, Oct 21, 2024, 13:42:37 - deeznnutz: "Pebble was introduced as an alternative storage engine to RocksDB in CockroachDB v20.1 (released May 2020) and was used in production successfully at that time. Pebble was made the default storage engine in CockroachDB v20.2 (released Nov 2020). Pebble is being used in production by users of CockroachDB at scale and is considered stable and production ready."
+Mon, Oct 21, 2024, 13:42:57 - georgezgeorgez: Maybe a good segue.
+Changing the db is something we can try first on hyperqube_z/hyperqube_hc1_betanet (name still up in the air)
+Mon, Oct 21, 2024, 13:43:37 - georgezgeorgez: The pillars are signalling incentives for operational support for getting it launched.
+Mon, Oct 21, 2024, 13:43:47 - georgezgeorgez: And I would task 0x and this SIG with those tasks.
+Mon, Oct 21, 2024, 13:44:10 - georgezgeorgez: I would want to come up with a hardware requirement
+Mon, Oct 21, 2024, 13:44:22 - georgezgeorgez: For hyperqube_z, I might throttle the network to less than mainnet actually.
+Mon, Oct 21, 2024, 13:44:36 - georgezgeorgez: Most of the txs will be voting for now, and it will help us test dynamic plasma more easily once that is ready
+Mon, Oct 21, 2024, 13:44:43 - vilkris: That would make sense
+Mon, Oct 21, 2024, 13:44:44 - georgezgeorgez: Message deleted
+Mon, Oct 21, 2024, 13:44:54 - georgezgeorgez: It would also mean lesser node requirement than mainnet
+Mon, Oct 21, 2024, 13:45:19 - georgezgeorgez: But I want to give an actual recommendation for the hardware
+Mon, Oct 21, 2024, 13:45:20 - vilkris: If mainnet had all full momentums the node requirements would be very high unfortunately
+Mon, Oct 21, 2024, 13:45:30 - vilkris: Because of issues with the node implementation
+Mon, Oct 21, 2024, 13:45:40 - georgezgeorgez: This would also give another usecase for the deploy script that the SIG has been working on
+Mon, Oct 21, 2024, 13:45:52 - georgezgeorgez: <@vilkris:hc1.chat "If mainnet had all full momentum..."> With K6, I want to start testing things like this
+Mon, Oct 21, 2024, 13:46:22 - deeznnutz: <@vilkris:hc1.chat "Because of issues with the node ..."> issues beyond what your performance improvements aim to fix and the issues w/ leveldb potentially?
+Mon, Oct 21, 2024, 13:46:50 - georgezgeorgez: I know one pillar has asked if they can run the hyperqube_z node alongside their mainnet pillar node.
+I probably wouldn't suggest it, but we might want to introduce some options to prevent clashes
+Mon, Oct 21, 2024, 13:47:00 - georgezgeorgez: service names, directory locations, port numbers
+Mon, Oct 21, 2024, 13:47:19 - georgezgeorgez: It gets into the ansible work as well, running multiple nodes
+Mon, Oct 21, 2024, 13:47:34 - georgezgeorgez: only would want to have 1 monitoring stack
+Mon, Oct 21, 2024, 13:47:40 - georgezgeorgez: that both nodes are sending data to
+Mon, Oct 21, 2024, 13:48:21 - vilkris: <@deeznnutz:zenon.chat "issues beyond what your performa..."> If the db issues are fixed there are still some problems but they can be alleviated by utilizing the cache from V2
+Mon, Oct 21, 2024, 13:48:43 - deeznnutz: down to about 10 minutes.
+Mon, Oct 21, 2024, 13:48:44 - deeznnutz: The only other item we discussed was setting up a testnet automatically.  This is one area where I think someone else might be better suited to lead the charge.  I can automate the steps in CF's repo (to setup a testnet) but I think we discussed more efficient (better) ways of doing it.  I think George had some good ideas and libraries / apps to help
+Mon, Oct 21, 2024, 13:49:24 - georgezgeorgez: i apologize in advance for the splits if they are annoying
+but I will probably setup a Test Tooling SIG in the near future
+Mon, Oct 21, 2024, 13:49:41 - georgezgeorgez: local devnets should be accessible to all devs now
+Mon, Oct 21, 2024, 13:50:04 - georgezgeorgez: i think fish's instructions use nomctl's devnet create action
+Mon, Oct 21, 2024, 13:50:19 - cryptofish: <@georgezgeorgez:hc1.chat "i think fish's instructions use ..."> Correct
+Mon, Oct 21, 2024, 13:50:45 - georgezgeorgez: I don't want to get too much into it for this SIG, but the dynamic testnet spin up work will go nicely into extension chain setup as well.
+Mon, Oct 21, 2024, 13:50:57 - georgezgeorgez: We could possibly implement it as a smart contract on hyperqube_z
+Mon, Oct 21, 2024, 13:51:38 - georgezgeorgez: And that would serve as a start for the extension chain embedded
+Mon, Oct 21, 2024, 13:51:54 - georgezgeorgez: it's the same concepts, agreeing on initial validators, genesis, time to start, etc
+Mon, Oct 21, 2024, 13:52:06 - georgezgeorgez: but not relevant to us here right now
+Mon, Oct 21, 2024, 13:52:26 - deeznnutz: Then those were the follow up items from last meeting
+Mon, Oct 21, 2024, 13:52:46 - deeznnutz: I think the 4w cadence works well.
+Mon, Oct 21, 2024, 13:52:55 - deeznnutz: and this time is fine for me too
+Mon, Oct 21, 2024, 13:53:56 - coinselor: maybe we add `--debug` to the wishlist? maybe it builds a diff go-zenon branch that includes pprof/etc tooling?
+Mon, Oct 21, 2024, 13:54:17 - georgezgeorgez: What's our action items?
+* I will get the sync Graphs created and integrated into our deployment process.
+* Once we have the modified code for hyperqube_z, we will need to write instructions and give a hardware recommendation.
+* Is there something we can do regarding leveldb right now?
+Mon, Oct 21, 2024, 13:54:41 - deeznnutz: I will work on finishing backup / restore locallly
+Mon, Oct 21, 2024, 13:55:31 - georgezgeorgez: deeznnutz: we can do an ansible crash course in this chat
+Mon, Oct 21, 2024, 13:55:41 - georgezgeorgez: I'll walk you through some things
+Mon, Oct 21, 2024, 13:56:30 - vilkris: <@georgezgeorgez:hc1.chat "What's our action items?"> With regard to leveldb - I'm interested in testing whether explicitly cleaning up the resources has an affect on the sync. And if the database is replaced there has to be a mechanism to cleanup database references since other databases explicitly require it
+Mon, Oct 21, 2024, 13:56:56 - vilkris: This means that quite a few places in the code have to be touched
+Mon, Oct 21, 2024, 13:57:36 - georgezgeorgez: Makes sense. I'm hoping once the incubator process is more mature and on-chain on hyperqube_z/hc1_betanet, that we can start getting forward incentives for this kind of work too.
+Mon, Oct 21, 2024, 13:57:53 - georgezgeorgez: I've asked 0x to start filling out a OP SIG WP1
+Mon, Oct 21, 2024, 13:58:18 - deeznnutz: Anything else before we close this meeting out?
+Mon, Oct 21, 2024, 13:58:27 - georgezgeorgez: Next meeting time
+Mon, Oct 21, 2024, 13:58:48 - georgezgeorgez: This should usually work for me
+it's possible i have a conflict but that's fine
+Mon, Oct 21, 2024, 13:58:51 - deeznnutz: 18 Nov at the same time?
+Mon, Oct 21, 2024, 13:59:05 - vilkris: Should work yes
+Mon, Oct 21, 2024, 13:59:20 - deeznnutz: Cool - thanks everyone!
+Mon, Oct 21, 2024, 13:59:37 - deeznnutz: === END OP SIG 21 Oct 2024 ===

HC1: OP SIG 21 Oct 2024: Difference between revisions

Revision as of 19:10, 4 November 2024

Agenda

Pre-meeting Notes

Meeting Minutes

Navigation menu

Search