Poly-graph ML

Machine learning has significantly advanced protection against web3 threats by utilizing learned transactional patterns to predict attacks, leading to well-trained models adapted to Ethereum's intricate network structure (shown below).

The local neighborhood of a single address (center). The richness and complexity the transaction network is well captured using “graph” geometry of nodes (vertices) corresponding to address and their features, and connecting lines corresponding to transactions. The colorization of nodes corresponds to address type, e.g. smart contracts are shown in blue.

Accurate attack detection models provide significant time and resource savings to financial institutions through automation. Yet despite extensive academic research in machine learning focused on web3 security, the industry faces unique challenges that such academic research often overlooks, such as the need for efficient, real time analysis. Additionally, there's a tradeoff between usability and accuracy, where generic models prioritize broad applicability over precision. Generic models often get stuck prioritizing broad applicability over precision, leading to a large number of “false positives”.

As a first line of defense, the industry has relied on the construction of blacklists (BLs), which are a primitive form of model called a filter. The BL idea is simple: create a registry of threatening entities and check potentially dangerous actors against it. While simple, BLs have several drawbacks. For one, they fail to provide real-time protection. As such, they are always one step behind new threats. In addition, from the perspective of web3, there are several intrinsic problems with the conventional BL approaches due to centralization. In a centralized threat registry like a BL trust is bestowed to aggregators, holders, and distributors of the data. If the labels are not meticulously validated along with their sources, such centralized blacklists quickly get corrupted and/or outdated and their usage becomes limited.

Instead of centralization with BLs, we propose a distributed community effort, with data producers, validators, and a consensus mechanism to confirm the predictions. These key components generate in part the crucial data labels which get used in the machine learning engine, the details of which we describe below.

The remainder of this section is outlined as follows. First, we present and describe an open source utility, called feature-beast, which allows ML inclined devs to construct new customized models and discuss how these models function in general. We then give a quick overview of our provisioned models that are benchmarked for accuracy and performance. Next we elaborate on how the components outlined in the tech architecture fit into the ML prediction sector. Lastly, we discuss how nodes can contribute their prediction models and the use of zero-knowledge proofs for verification of the standards/requirements contributions must meet for accuracy.

Show are pairs of phishing (left and leftmost) and benign transaction (right and rightmost) networks. The “jellyfish” geometrical patterns indicative of phishing are learned by prediction models, such a graph neural networks though a process called embedding, which uses mathematical representations of the historical transaction data.

Feature Beast: Open Source ML Toolkit

ML implementations for address classification come in many forms. In particular, graph neural networks (GNNs) excel in address classification due to their ability to learn geometric representations of hack, phishing, and money laundering patterns involving large sub-networks, instead of relying on “shallow” statistics involving one address and its senders and recipients. At the end of this section we also review a blueprint for future graph neural network implementation, as guided by the geometric deep learning program.

In general, to build a trained address classification model, such a graph neural network, for generalized fraud prediction, one follows these steps:

  1. Data Preprocessing: Collect transaction data from the Ethereum blockchain, representing nodes as accounts and edges as transactions between them.

  2. Feature Engineering: Extract features such as transaction amounts, frequencies, and types, and encode them into node and edge attributes.

  3. Graph Construction: Build a graph structure from the preprocessed data, where nodes represent addresses and edges represent transactions.

  4. Model Design: Design an architecture tailored for classification tasks, in the case of GNNs incorporating message passing, convolutional, or attention layers to aggregate information from neighboring nodes and edges.

  5. Training: Train the model using labeled data, optimizing the parameters to minimize a suitable loss function, such as cross-entropy loss.

  6. Evaluation: Evaluate the model's performance using metrics like accuracy, precision, recall, and F1 score on a hold-out validation set.

  7. Prediction: Apply the trained model to new transaction data to predict the likelihood of fraud for each target account, identifying suspicious activities for alert broadcast candidates.

Once the architecture for the model is created it can be refined and tuned to new data labels as the arise through the implementation of a “training cycle” which keeps the model up to date. The a simple round of a training cycle process is illustrated in the diagram below.

Due to the diverse nature of web3 attacks, a singular prediction model proves ineffective, necessitating the utilization of tailored models aligned with specific threats, exemplified by the differing requirements between a DEX subject to AML regulations and a node provider concerned with Sybil attacks, although most applications can leverage the common graph structure of transaction networks, albeit with varying input variables and training algorithms.

Polyzoa PLZ-prediction-toolkit, centered around the universal "threat-graph" structure, serves as the cornerstone for crafting customized prediction models tailored to individual threats within web3 transaction networks, providing developers with an accessible toolbox designed for this purpose. It allows ML developers to leverage their labels, individual tech-stacks, and hardware configurations to combat attacks in precise ways that would impossible with a generic broad spectrum prediction code.

The PLZ-prediction-toolkit comprises the following essential components: an aggregator for efficiently harvesting transaction data, a groomer dedicated to cleaning and preparing the data with the labels, a sophisticated "feature beast" designed specifically for constructing intricate statistical features essential for modeling, and models used for predictions themselves.

Feature-beast: An open source feature construction tools

The feature-beast component of the toolkit facilitates model construction by aiding feature generation.

Polyzoa feature beast, a Python package, specializes in time series feature extraction to develop custom model variables, or "features," tailored for web3 security use, extracting statistical features from transaction history. It utilizes a unique mechanism that leverages blockchain data to characterize attacks and exploits within specific sectors, allowing for the creation of heuristic filters or model construction for phishing, ice-phising, AML, etc.

Poly-graph: Decentralized Modeling

Poly-graph is Polyzoa’s decentralized implementation of machine learning based network security. It gives autonomy to participating nodes to develop and use specialized models.

Development of high quality models is incentivized by payment of tokens to contributors. Feature developers can contribute their features to a shared registry, the Polyzoa Feature Store, for evaluation. The features having the highest importance, as determined through their mathematical weights, may be incorporated into the provisioned models provided certain accuracy requirements are fulfilled (see below). Furthermore, as incentive, the developers are rewarded with $PLZ tokens, the amount of which is outline by the detailed tokenomics.

Poly-graph, the distributed network of prediction models

Deployable containers submitted to the community model registry, the Polyzoa Model Store, hold a distinct significance. Subject to content standardization, these models can be contributed for widespread adoption by validator nodes in exchange for rewards. Models meeting an accuracy requirement of 70% precision and 50% recall are eligible for adoption and subsequent rewards. Once adopted, broadcast models announce predictions as alerts for consensus evaluation, with validated predictions recorded into the "risk classification ledger," further expanding the Poly-graph chain constituents (see architecture section).

Provisioned Models

Polyzoa will offer a diverse range of pre-trained models, covering fundamental attack types, accessible to the community via a git repository. Additionally, Polyzoa will provide a stable release of benchmarked prediction models, known as gr-eth**, specifically designed for out-of-the-box address classification, catering to immediate prediction requirements and validator roles. Provisioned models come in two basic forms: heuristic and trained.

Logic-Condition Models (Heuristic Filters)

Heuristic filters enable rapid prediction with minimal computational burden, functioning through logical conditions aimed at identifying the most probable candidates efficiently.

These filters serve as an initial screening mechanism to identify suspicious transactions for further investigation. Types of heuristic filters include:

  • Anomalous Volume: Flag transactions with unusually large or small amounts compared to the typical transaction volume for a particular account or network.

  • Frequency Thresholds: Identify accounts or transactions with abnormally high frequencies within a given time period, suggesting potential bot activity or spam.

  • Blacklists: flagging transactions involving known addresses associated with suspicious activities.

Additional heuristic filters can be designed in particular using the plz-prediction-toolkit’s feature_beast.

Trained Numerical Models

Trained numerical models are those built using the 7 steps outlined above. We provide a trained phishing prediction model for Ethereum network addresses, the gr-eth classifier, deployable on any participating node to predict phishing and ice-phishing actors based on transaction history. Its features were constructed using labeled data from multiple sources and include:

X = {address, min sent, avg sent, min received, max received, txn cnt, unique received cnt, eff balance, amt received, time active, avg time bw sent txn, avg time bw received txn}.

The prediction algorithm employs extreme-gradient boosting and ensemble learning for both speed and accuracy, optimized through 5-fold cross-validation to minimize log-loss, while future iterations will feature a graph-neural-network predictor incorporating message passing and local aggregations between nodes. The associated training dataset will be released along with model to facilitate benchmarking etc.

Accuracy Criteria, Contributed Labels, and Distribution of Training Information

With the establishment of a distributed data set, or threat-graph, ML scientists and engineers to build prediction models which are in line with the philosophy of web3. As outlined in the consensus protocol sections, address scores are broadcasted to the validators for confirmation before being inserting into the threat-graph database. To recap, when a model prediction is finished, a prediction result message is broadcasted throughout the poly-graph network. Once the threat level has been verified by the consensus mechanism outlined in Sec.?, the participating Library nodes re-sync their registries and update their threat-graphs with the new information. Addresses having a validated prediction score which exceeds a preset threshold risk value of 0.65 are broadcasted as a generalized alert.

In addition we also let participants who want to run their own nodes, the so-called eagles, set their threshold based on their false positive rate requirements, but the (sentry) nodes running the shared models must conform to certain requirements of accuracy to avoid missed cases and to minimize false positives.

Monitoring

In addition to the threat-graph being updated with the prediction, newly formed label information is tagged with metadata about the model, date, etc, and then added to a data lake by the Librarians (see Tech section) to be kept to monitor the models for prediction skew. The monitoring helps counteract slippage between the training data and the actual predictions from new incoming data, and thus keeps the models, and therefore also the sentry nodes, up to date with the latest threats.

Appendix:

Terminology

Supervised Classification

Supervised classification uses labelled records which are seen during training. An example of supervised learning is a scam token classifier, which identifies characteristics associated with scam and benign tokens in order to categorize them. With sufficient labelled data, supervised classification algorithms can learn to predict a variety of behaviors including phishing/ice-phishing, malicious contract code, ransomware activity, scam token distribution, spam airdrops, etc.

Heuristic Filtering

Heuristic filtering uses simple rule of thumb logic evaluations, usually “yes” or “no”, to narrow down the pool of threat candidates to those which are most likely to attack. The design of the heuristics is part art and part science, as it takes a sleight of hand to derive the best results. The filters are usually designed empirically based on observed transaction patterns of malicious entities and/or, in more sophisticated forms, using the training weights (feature importances) in pre-existing models. Usually, a feature importance (weights) assessment from a supervised learning model gives the most accurate heuristics, but often at the expense of interpretation.

Unsupervised Classification (anomaly detection):

The term anomaly detection in transaction networks **refers to the problem of finding exceptional patterns in transaction traffic that do not conform to the expected normal behavior. These “nonconforming” patterns or values, often referred to as anomalies, outliers, exceptions etc., arise as deviations from the “mean behavior” in the time-series transaction history of single or groups of addresses or in the entities themselves taking extreme values. Thus, to provide an appropriate solution in network anomaly detection, one needs a concept of “normality” and a concept of “distance” from normality.

Performance measures:

For generic predictions models, there is a tradeoff between precision, or ability to avoid false alarms, and recall, which is penalizes the failure to catch positive cases. The best models form a compromise of the two. To avoid a large number of false positive reports, a confidence threshold for classifying positive cases must be determined by the threat model on a case by case basis.

Last updated