Create Publication

We are looking for publications that demonstrate building dApps or smart contracts!
See the full list of Gitcoin bounties that are eligible for rewards.

Solution Thumbnail

Artificial Intelligence on Algorand

Overview

Introduction: Data and Machine Learning

A convergence of breakthrough technologies in big data and data analytics provide a critical solution to meet objectives. Big data and advanced analytics play a key role in raising productivity of knowledge-intensive tasks, maximizing assets, and facilitating personalized digital experience.

Predictive process monitoring at runtime is especially growing in importance. Predicting the remaining cycle time, compliance, sequence of process activities, the final or partial outcome, or the prioritization of processes helps organizations to make decisions and gain valuable insights in a rapidly evolving environment.

In this article we connect to the Algorand Indexer to access and collect USDC transactions data using Python. We also demonstrate the utility of predicting transaction volume with Machine Learning.

Algorand as a shared database

Blockchain at its core is a type of a shared database that differs from a typical database in the way it stores information. A blockchain is a growing list of records, also called blocks, that are linked together using the latest cryptography technology.

A blockchain is also an immutable ledger of transactions. The first block in the chain is called the Genesis block. As each new block or transaction is recorded, it is added to the previous one to form a chain of data records or a blockchain. As a result, a blockchain contains every transaction recorded since the ledger was started.

Accessing the Data

When a person or an application accesses the data to make a transaction, in fact one is calling nodes on the network. The blockchain protocol guarantees that you can reconstruct the data from pieces of received information correctly and trustfully.

The entire blockchain data is in order of hundreds of gigabytes, so if you configure your software to store it locally the software will typically download a number of large files from other computers and store it on your disk. However, getting a large amount of filtered and aggregated data directly from a node would take days if not weeks.

An indexer lets you explore the blockchain from your local machine. The data comes directly from an Algorand node and can be accessed from your local desktop with a third-party APIs.

For deep analysis such as conducting market research we would want to explore entire histories of addresses, calls and traces which requires a large amount of data. With indexer we can effectively serve an application filtered and aggregated data efficiently.

Algorand PureStake Indexer

The most convenient endpoint to query Algorand is PureStake Indexer that is available on Algorand’s testnet, betanet and mainnet.

Besides, we chose python as a programming language since its designed to get things done. The simple, uncluttered syntax and clean design makes it easy to write code that just works. Python library for interacting with the Algorand network: py-algorand-sdk.

After we registered on the purestake.com to obtain the api keys, we configured the environment configuration file to store the API KEY. Next configure header that contains the api_key, we also choose mainnet purestake indexer. Finally, we initialize the Algorand Indexer Client. The result of the script should indicate the successfull connection and demonstrate the health of the indexer.

Accessing Indexer

To access the indexer we create python file, import libraries and load an environment api key.

import os
from dotenv import load_dotenv

from algosdk.v2client.indexer import IndexerClient

load_dotenv()

SECRET_KEY = os.getenv("API_KEY")

algod_header = {
    'User-Agent': 'Minimal-PyTeal-SDK-Demo/0.1',
    'X-API-Key': SECRET_KEY
}

indexer_address = "https://mainnet-algorand.api.purestake.io/idx2"

algod_indexer = IndexerClient(
    SECRET_KEY,
    indexer_address,
    algod_header
)

print(algod_indexer.health())

Querying transactions

The script above we further extend to query the data of USDC stablecoin, which we can utilize to predict market volume that significantly enhances leeway for settlement on-chain.

USDC is available natively as Ethereum ERC-20, Algorand ASA, Solana SPL token, Stellar asset, TRON TRC-20, Hedera token, and Avalanche ERC-20.

On Algorand USDC asset_id is 31566704 available on Algorand Explorer.

In the following script we add get_data function to query the indexer based on the given time frame, in particular, starting and closing round. Inside the function we iterate over the transaction using search_transactions that meet the criterias of at least 1 unit of value and in the rounds boundary. The data is written in json and stored in the data folder.

import os
import json
from dotenv import load_dotenv

from algosdk.v2client.indexer import IndexerClient

load_dotenv()

SECRET_KEY = os.getenv("API_KEY")

algod_header = {
    'User-Agent': 'Minimal-PyTeal-SDK-Demo/0.1',
    'X-API-Key': SECRET_KEY
}

indexer_address = "https://mainnet-algorand.api.purestake.io/idx2"

algod_indexer = IndexerClient(
    SECRET_KEY,
    indexer_address,
    algod_header
)

def get_data(asset_id, min_round, end_round):

    reference_point = min_round
    stride = 10000
    while True:
        if end_round <= reference_point:
            break
        data = algod_indexer.search_transactions(min_round=reference_point, max_round=reference_point+stride, asset_id=asset_id, min_amount=1, limit=10000)

        len_slice = len(data['transactions'])

        else:

            file = json.dumps(data['transactions'])

            with open(f'data/json_data_{asset_id}_{reference_point}-{stride}.json', 'w') as outfile:
                outfile.write(file)

        reference_point+=10000

min_round = 11611000
max_round = 16711000
usdc = 31566704
get_data(usdc, min_round, max_round)

Data Transformation and Visualisation

Next we transform the exported data and conduct exploratory data analysis.

The imported libraries:

import glob, os, json

import pandas as pd
import numpy as np

import plotly.express as px

pd.set_option('display.max_columns', 35)
pd.set_option('display.max_colwidth', None)


import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.neural_network import MLPRegressor
import sklearn.metrics as metrics

We load and concatenate batches of exported data as well as convert numbers to human readable format.

json_dir = os.getcwd() +'/data/'

json_pattern = os.path.join(json_dir, '*.json')
file_list = glob.glob(json_pattern)

dfs = []

for file in file_list:
    with open(file) as f:
        json_data = pd.json_normalize(json.loads(f.read()))
    dfs.append(json_data)

df = pd.concat(dfs)

df.fee = df.fee / 1000000
df['asset-transfer-transaction.amount'] = df['asset-transfer-transaction.amount'] / 1000000
df['round-time'] = pd.to_datetime(df['round-time'], unit='s')

We also group data by day and sum the asset transaction amount, which we demonstrate with plotly.

ag_df = df.groupby(by=[df['round-time'].dt.date])['asset-transfer-transaction.amount'].agg(volume='sum', mean='mean')
fig = px.line(ag_df, y=ag_df.volume, x=ag_df.index)
fig.update_layout(template="plotly_dark")
fig.show()

EditorImages/2022/03/24 03:17/newplot.png

Feature Engineering and Prediction

We also add time series based features and get rid of not available numbers:

# inserting new column with yesterday's values
ag_df.loc[:,'volume-1'] = ag_df.loc[:,'volume'].shift()

# inserting another column with difference between yesterday and day before yesterday's consumption values.
ag_df.loc[:,'volume_diff'] = ag_df.loc[:,'volume'].diff()
ag_df = ag_df.dropna()

Define Testing and Training Datasets by 80%.

train = ag_df[:int(len(ag_df)*0.8)]
test = ag_df[int(len(ag_df)*0.8):]

X_train, X_test, y_train, y_test = train.drop('volume', axis=1), test.drop('volume', axis=1), train['volume'], test['volume']

We have chosen sklearn in-built neural network and Random Forest model. We also used Gridsearch to compare and configure the best model.

model = RandomForestRegressor(random_state=42)
param_search = { 
    'n_estimators': [10, 20, 50, 100],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [i for i in range(1,15)]
}
tscv = TimeSeriesSplit(n_splits=4)
gsearch = GridSearchCV(estimator=model, cv=tscv, param_grid=param_search, scoring = 'neg_mean_squared_error')
gsearch.fit(X_train, y_train)
rf_best_score = gsearch.best_score_
rf_best_model = gsearch.best_estimator_
print(f"{rf_best_model} at {rf_best_score}")


model = MLPRegressor(random_state=1)
param_search = { 
    'max_iter':[ 100, 200, 400, 600, 800, 1000],
    'solver': ['lbfgs', 'sgd', 'adam'],
    'activation':['identity', 'logistic', 'tanh', 'relu'],
}

tscv = TimeSeriesSplit(n_splits=4)
gsearch = GridSearchCV(estimator=model, cv=tscv, param_grid=param_search, scoring = 'neg_mean_squared_error')
gsearch.fit(X_train, y_train)
best_score = gsearch.best_score_
best_model = gsearch.best_estimator_
print(f"{best_model} at {best_score}")

Results

The scoring function:

def regression_results(y_true, y_pred):
    # Regression metrics
    explained_variance = metrics.explained_variance_score(y_true, y_pred)
    mean_absolute_error = metrics.mean_absolute_error(y_true, y_pred) 
    mse=metrics.mean_squared_error(y_true, y_pred) 
    mean_squared_log_error=metrics.mean_squared_log_error(y_true, y_pred)
    r2=metrics.r2_score(y_true, y_pred)

    print('explained_variance: ', round(explained_variance, 4))    
    print('mean_squared_log_error: ', round(mean_squared_log_error, 4))
    print('r2: ', round(r2, 4))
    print('MAE: ', round(mean_absolute_error, 4))
    print('MSE: ', round(mse, 4))
    print('RMSE: ', round(np.sqrt(mse), 4))

Yielded the following results:

EditorImages/2022/03/24 03:28/Screenshot_from_2022-03-24_06-27-51.png

Conclusion

Sklearn Neural Network has significantly outperformed out-of-sample Random Forest model. The R-square result is perfect 1 and Mean square error is close to zero.

This means that we can utilize out of the box sklearn neural network with minor parameter enhancements to reliably obtain volume of the USDC transactions on the daily basis.

One of the ideas is to use to configure the indexer with fast machine learning algorithm as a supplement to an oracle on-chain. The smart contract containing the prediction could further aid its users and infrastructure.