I am not an expert in using SLURM and using HPCs, but I have used it for a while now, and I have found it to be a very useful tool for managing HPC jobs. In this post, I will provide a brief introduction to SLURM, DelftBlue and INSY, and I will provide some basic information on how to use SLURM to manage HPC jobs on DelftBlue and INSY clusters.
I have noticed that some students — particularly those with little to no technical background — are having difficulty using SLURM to manage HPC workloads on the DelftBlue and INSY clusters. This observation was the impetus for this post. I wanted to write a brief tutorial for those students that would cover all the essential information in one location. This post aims to assist you in getting started with SLURM and HPC tasks on the DelftBlue and INSY clusters. Examples provided here are mainly for GPU-based jobs, but similar principles apply to CPU-based jobs as well.
But I would strongly suggest to read the official documentation of DelftBlue and INSY, as it is very well written and has a lot of useful information.
SLURM (Simple Linux Utility for Resource Management) is a powerful open-source cluster management and job scheduling system that is widely used in High Performance Computing (HPC) environments. It is designed to be highly scalable, fault-tolerant, and easy to use.
To submit a GPU job to the SLURM scheduler, you will need to use the sbatch
command. The sbatch
command allows you to submit a batch script to the scheduler, which will then execute the script on the appropriate resources. Here is an example of a simple SLURM batch script that requests one GPU and runs a command:
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --time=00:10:00
# Execute the command
./your_command
In this example, the #SBATCH
command requests one GPU, one node and it will run for 10 minutes. You can edit the script and include your commands for the job.
To submit the job, use the sbatch
command followed by the name of the batch script file:
sbatch my_job.sh
Once the job is submitted, you can use the squeue
command to view the status of your job. This command will display information about the job such as the job ID, the user who submitted the job, the status of the job, and more.
squeue -u <username>
To cancel a job, you can use the scancel
command followed by the job ID.
scancel <job_id>
After your job is completed, you can use the sacct
command to view accounting information about your job, including the resources it consumed and the exit status of the job.
These are the basic steps for using SLURM to manage GPU-based HPC jobs. Be sure to consult the SLURM documentation for more information on how to use the system, including advanced configuration options and troubleshooting tips.
DelftBlue is a high-performance computing cluster that is used for research and education at TU Delft. It is a heterogeneous cluster that consists of a mix of CPU and GPU nodes. The official documentation is maintained in this link.
If you are supervisor having a student need to use DHPC, or aforementioned student you can request a project for your student. The request form is here and must be filled by student.
I feel lazy to learn to use GUI based softwares, so I use terminal, and give command examples
To connect to DelftBlue, you will need to use SSH. The login node is login.delftblue.tudelft.nl
. You can connect to the login node using the following command:
ssh <netid>@login.delftblue.tudelft.nl
scp
command, you can copy files to and from DelftBlue. Here are some examples of using the scp command:
scp <source> <target>
# Copying files from local machine to DelftBlue
scp <source> <netid>@login.delftblue.tudelft.nl:<target>
# Copying files from local machine to DelftBlue recursively
scp -r <source> <netid>@login.delftblue.tudelft.nl:<target>
# Copying files from DelftBlue to local machine
scp <netid>@login.delftblue.tudelft.nl:<source> <target>
# Copying files from DelftBlue to local machine recursively
scp -r <netid>@login.delftblue.tudelft.nl:<source> <target>
sftp
command, you can transfer files to and from DelftBlue. Here are some examples of using the sftp
command:
sftp <netid>@login.delftblue.tudelft.nl
# Changing directory in DelftBlue
cd <directory>
# Creating directory in local machine
lcd <directory>
# Listing files in DelftBlue
ls
# Listing files in local machine
lls
# Just add an 'l' to the beginning of the command to perform the same operation on the local machine
# Copying files from local machine to DelftBlue
put <source> <target>
# Copying files from local machine to DelftBlue recursively
put -r <source> <target>
# Copying files from DelftBlue to local machine
get <source> <target>
# Copying files from DelftBlue to local machine recursively
get -r <source> <target>
Modules are a way to manage software on a cluster. They allow you to load and unload software packages, and they allow you to manage dependencies between software packages. Modules are loaded using the module
command. Here are some examples of using the module
command:
# Loading a module
module load <module_name>
# Unloading a module
module unload <module_name>
# Listing loaded modules
module list
# Listing available modules
module avail
The modules available on DelftBlue are listed in the modules page. Use spider
command to search for modules
module spider <module_name>
You can use the following command if you are not knowing what you are doing,
module load 2022r2 # load the default DelftBlue software stack
module load cuda/11.6 # or cuda you need
module load miniconda3 # loading the conda
Checking the cuda version installed
[<netid>@login04 ~]$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
Conda is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs, and updates packages and their dependencies. Conda easily creates, saves, loads, and switches between environments on your local computer. It is mainly used for Python programs.
To use conda, you will need to load the conda module. Here are some examples of using the conda
command:
# Loading the conda module
module load miniconda3
# Creating a conda environment
conda create -n <environment_name> <package_name>
# Activating a conda environment
conda activate <environment_name>
# Deactivating a conda environment
conda deactivate
# Listing conda environments
conda env list
# Listing packages in a conda environment
conda list
# Installing a package in a conda environment
conda install <package_name> -c <channel_name>
# Removing a package from a conda environment
conda remove <package_name>
# Removing a conda environment
conda env remove -n <environment_name> --all
Yes, you should use conda environment on DelftBlue. Is that I all need to do? No, the reason is that conda environments are stored in your home directory, and not in the shared file system. This means that you will not run into storage issues when using conda environments. And it will happen very quickly, believe me.
To avoid storage issues, you should create a conda environment on the scratch storage and link to them in your home directory.
mkdir -p /scratch/${USER}/.conda
ln -s /scratch/${USER}/.conda $HOME/.conda
On similar lines, you can also create a cache and local folders on the scratch storage and link to them in your home directory. This may also help you avoid storage issues related to pip
.
mkdir -p /scratch/${USER}/.cache
ln -s /scratch/${USER}/.cache $HOME/.cache
mkdir -p /scratch/${USER}/.local
ln -s /scratch/${USER}/.local $HOME/.local
#!/bin/sh
# You can control the resources and scheduling with '#SBATCH' settings
# (see 'man sbatch' for more information on setting these parameters)
#SBATCH --job-name="CasMVS" # project name
#SBATCH --partition=gpu # partition name it means i want to use gpu
#SBATCH --time=02:00:00 # time limit (HH:MM:SS)
#SBATCH --ntasks=1 # number of parallel tasks per job is 1
#SBATCH --cpus-per-task=2 # number of cores per task
#SBATCH --gpus-per-task=1 # number of GPUs per task
#SBATCH --mem-per-cpu=1G # memory per CPU core
#SBATCH --account=research-abe-ur # account name
# Measure GPU usage of your job (initialization)
previous=$(nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/tail -n '+2')
# Use this simple command to check that your sbatch settings are working (it should show the GPU that you requested)
nvidia-smi
# Your job commands go below here
#module load 2022r2
#module load cuda/11.6
srun python train.py --dataset_name dtu --root_dir /scratch/<netid>/DTU/dtu/ --num_epochs 16 --batch_size 2 --depth_interval 2.65 --n_depths 8 32 48 --interval_ratios 1.0 2.0 4.0 --optimizer adam --lr 1e-3 --lr_scheduler cosine --exp_name dtu_cas_group_8 --num_groups 8 --num_gpus 1 > test.log
# Your job commands go above here
# Measure GPU usage of your job (result)
nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/grep -v -F "$previous"
Some parts are stolen from Zexin’s internal documentation.
I will keep it shorter and simple. I will not go into details. I will just give you the commands and you can figure out the rest. INSY only meant for PhD students and postdocs and employees. If you are not one of them, you should not be here.
Some parts are similar to DelftBlue. I will not repeat them. I will just give you the commands and you can figure out the rest.
Please have a look at the INSY page and links underneath for general Cluster information and general tutorial.
ssh <netid>@linux-bastion.tudelft.nl # TU Delft bastions
ssh <netid>@login1.hpc.tudelft.nl # INSY login node
If you don’t have a personal project directory, request data storage through this form.
Requests exceeding 5TB will be forwarded to FIM, which should be harder to get granted. After approval, the requested project folder will have been created at the following location: '/tudelft.net/staff-umbrella/<pname>'
, <pname>
being the folder name you provided when making the storage request.
The preffered solution is to use ‘Filezilla’. You can download it from here.
If you lazy as me, you can use scp
, sftp
, or rsync
command.
scp -r <local_path> <netid>@linux-bastion.tudelft.nl:<remote_path>
sftp <netid>@linux-bastion.tudelft.nl
put <local_path> <remote_path>
rsync -av --no-perms <local_path> <remote_path> # --no-perms is important
# check the commands in DelftBlue section for more details
Assuming you are in your home directory, you can install conda by running the following commands:
wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ./miniconda3/bin/activate
conda create -n <environment_name> python=3.8
conda activate <environment_name>
# for installing packages you can refer to DelftBlue section
module use /opt/insy/modulefiles
module avail
module whatis <module_name> (e.g. module whatis cuda)
module load <module_name> (e.g. module load cuda/11.2)
module list
For more information, please refer to INSY page and check the DelftBlue section for similar commands.
#SBATCH --gres=gpu # First available GPU
#SBATCH --gres=gpu:2 # Two GPUs for the same job
#SBATCH --gres=gpu:pascal:2 # Two GPUs of type pascal, check the table below for available gpu options
Check the table below for available gpu options.
Type | GPU | Architecture | Capability | Memory | Cores |
---|---|---|---|---|---|
p100 | Nvidia Tesla P100 | Pascal | 6.0 | 16 GB | 3584 |
pascal | Nvidia GeForce GTX1080 Ti | Pascal | 6.1 | 11 GB | 3584 |
turing | Nvidia GeForce RTX2080 Ti | Turing | 7.5 | 11 GB | 4352 |
v100 | Nvidia Tesla V100 | Volta | 7.0 | 16-32 GB | 5120 |
a40 | Nvidia A40 | Ampere | 8.6 | 48 GB | 10752 |
#!/bin/sh
# You can control the resources and scheduling with '#SBATCH' settings
# (see 'man sbatch' for more information on setting these parameters)
# The default partition is the 'general' partition
#SBATCH --partition=general
# The default Quality of Service is the 'short' QoS (maximum run time: 4 hours)
#SBATCH --qos=short
# The default run (wall-clock) time is 1 minute
#SBATCH --time=0:01:00
# The default number of parallel tasks per job is 1
#SBATCH --ntasks=1
# The default number of CPUs per task is 1 (note: CPUs are always allocated to jobs per 2)
# Request 1 CPU per active thread of your program (assume 1 unless you specifically set this)
#SBATCH --cpus-per-task=2
# The default memory per node is 1024 megabytes (1GB) (for multiple tasks, specify --mem-per-cpu instead)
#SBATCH --mem=1024
# Request a GPU
#SBATCH --gres=gpu:1
# Set mail type to 'END' to receive a mail when the job finishes
# Do not enable mails when submitting large numbers (>20) of jobs at once
#SBATCH --mail-type=END
# Measure GPU usage of your job (initialization)
previous=$(/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/tail -n '+2')
# Use this simple command to check that your sbatch settings are working (it should show the GPU that you requested)
/usr/bin/nvidia-smi
# Your job commands go below here
# Uncomment these lines and adapt them to load the software that your job requires
#module use /opt/insy/modulefiles
#module load cuda/11.2 cudnn/11.2-8.1.1.33
# Computations should be started with 'srun'. For example:
#srun python my_program.py
# Your job commands go above here
# Measure GPU usage of your job (result)
/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/grep -v -F "$previous"
I will try to keep this tutorial up to date and actively maintain it.
If you find any errors or have any suggestions, please feel free to open an issue or pull request in git.
We are interested in a class of functions $\Phi$ that satisfy equations of the form \begin{equation} F \left ( \mathbf{x}, \Phi, \nabla_\mathbf{x} \Phi, \nabla^2_\mathbf{x} \Phi, \ldots \right) = 0, \quad \Phi: \mathbf{x} \mapsto \Phi(\mathbf{x}). \label{eqn:functional} \end{equation}
Here we are going to discuss and focus on simple implementation of SIRENs which propose to leverage periodic activation functions for implicit neural representations. This work demonstrates that these networks, dubbed sinusoidal representation networks or SIRENs, are ideally suited for representing complex natural signals and their derivatives. While learning process, I went through following implementations (1,2,3) to understand better the SIRENs (and reuse the parts) and to come up to with simple testing implementation.
Authors also show that SIRENs can be leveraged to solve challenging boundary value problems, such as particular Eikonal equations (yielding signed distance functions), the Poisson equation, and the Helmholtz and wave equations. Lastly, we combine SIRENs with hypernetworks to learn priors over the space of SIREN functions.
Most of these recent network representations build on ReLU-based multilayer perceptrons (MLPs). While promising, these architectures lack the capacity to represent fine details in the underlying signals, and they typically do not represent the derivatives of a target signal well. This is partly due to the fact that ReLU networks are piecewise linear, their second derivative is zero everywhere, and they are thus incapable of modeling information contained in higher-order derivatives of natural signals. While alternative activations, such as tanh or softplus, are capable of representing higher-order derivatives, we demonstrate that their derivatives are often not well behaved and also fail to represent fine details.
The main highlights of this work are:
SIREN propose a simple neural network architecture for implicit neural representations that uses the sine as a periodic activation function:
\[\begin{equation} \Phi \left( \mathbf{x} \right) = \mathbf{W}_n \left( \phi_{n-1} \circ \phi_{n-2} \circ \ldots \circ \phi_0 \right) \left( \mathbf{x} \right) + \mathbf{b}_n, \quad \mathbf{x}_i \mapsto \phi_i \left( \mathbf{x}_i \right) = \sin \left( \mathbf{W}_i \mathbf{x}_i + \mathbf{b}_i \right). \end{equation}\]Here, $\phi_i: \mathbb{R}^{M_i} \mapsto \mathbb{R}^{N_i}$ is the $i^{th}$ layer of the network. It consists of the affine transform defined by the weight matrix $\mathbf{W}_i \in \mathbb{R}^{N_i \times M_i}$ and the biases $\mathbf{b}_i\in \mathbb{R}^{N_i}$ applied on the input $\mathbf{x}_i\in\mathbb{R}^{M_i}$, followed by the sine nonlinearity applied to each component of the resulting vector.
class SineLayer(nn.Module):
def __init__(
self,
in_features,
out_features,
bias=True,
first_layer=False,
omega=30,
custom_init=None,
):
super().__init__()
self.omega = omega
self.linear = nn.Linear(in_features, out_features, bias=bias)
if custom_init is None:
paper_init(self.linear.weight, first_layer=first_layer, omega=omega)
else:
custom_init_function_(self.linear.weight)
def forward(self, x):
return torch.sin(self.omega * self.linear(x)) # sin(omega * (Wx + b))
The key idea in our initialization scheme is to preserve the distribution of activations through the network so that the final output at initialization does not depend on the number of layers. Note that building SINERs with not carefully chosen uniformly distributed weights yielded poor performance both in accuracy and in convergence speed.
Feel free to skip following paragraph for technical details.
Refer to paper and supplementary part for more mathematical and emprical insights.
Consider the output distribution of a single sine neuron with the uniformly distributed input $x \sim \mathcal{U}(-1, 1)$. The neuron’s output is $y = \sin(ax + b)$ with $a,b \in\mathbb{R}$. In supplementary part, paper shows that for any $a>\frac{\pi}{2}$, i.e. spanning at least half a period, the output of the sine is $y\sim\text{arcsine}(-1,1)$, a special case of a U-shaped Beta distribution and independent of the choice of $b$. Taking the linear combination of $n$ inputs $\mathbf{x}\in\mathbb{R}^n$ weighted by $\mathbf{w}\in\mathbb{R}^n$, its output is $y=\sin(\mathbf{w}^T\mathbf{x} + b)$. Assuming this neuron is in the second layer, each of its inputs is arcsine distributed. When each component of $\mathbf{w}$ is uniformly distributed such as $w_i \sim \mathcal{U}(-c/{\sqrt{n}}, c/{\sqrt{n}}), c\in\mathbb{R}$, author shows (see supplemental) that the dot product converges to the normal distribution $\mathbf{w}^T\mathbf{x} \sim \mathcal{N}(0, c^2/6)$ as $n$ grows. Feeding this normally distributed dot product through another sine is also arcsine distributed for any $c>\sqrt{6}$.
Paper proposes to draw weights with $c=6$ so that $w_i \sim \mathcal{U}(-\sqrt{6/n}, \sqrt{6/n})$. This ensures that the input to each sine activation is normal distributed with a standard deviation of $1$. Since only a few weights have a magnitude larger than $\pi$, the frequency throughout the sine network grows only slowly. Based on the experiments, authors suggest to initialize the first layer of the sine network with weights so that the sine function $\sin(\omega_0\cdot\mathbf{W}\mathbf{x} + \mathbf{b})$ spans multiple periods over $[-1,1]$. Experimental results shows $\omega_0=30$ to work well for all the applications in this work.
def paper_init(weight, first_layer=False, omega=1):
in_features = weight.shape[1] # input shape
with torch.no_grad():
if first_layer:
bound = 1 / in_features # first layer [-1/in_features, 1/in_features] uniform distribution
else:
bound = np.sqrt(6 / in_features) / omega # rest of the layers [-sqrt(6/in_features)/omega, sqrt(6/in_features)/omega] uniform distribution
weight.uniform_(-bound, bound)
Consider the case of finding the function $\Phi:\mathbb{R}^2 \mapsto \mathbb{R}^1, \mathbf{x} \to \Phi(\mathbf{x})$ that parameterizes a given discrete intensity image $f$ in a continuous fashion. The image defines a dataset \(\mathcal{D}=\{(\mathbf{x}_{i}, f(\mathbf{x}_i))\}_i\) of pixel coordinates \(\mathbf{x}_i=(x_i,y_i)\) associated with their grayscale intensity \(f(\mathbf{x}_i)\). The only constraint $\mathcal{C}$ enforces is that $\Phi$ shall output image intensity at pixel coordinates, solely depending on $\Phi$ (none of its derivatives) and \(f(\mathbf{x}_i)\), with the form \(\mathcal{C}(f(\mathbf{x}_i),\Phi(\mathbf{x}))=\Phi(\mathbf{x}_i) - f(\mathbf{x}_i)\) which can be translated into the loss $\tilde{\mathcal{L}} = \sum_{i} \vert \Phi(\mathbf{x}_i) - f(\mathbf{x}_i)\vert^2$.
Below you can see the experiment where only supervision on the image values holds, but also visualize the gradients $\nabla f$ and Laplacians $\Delta f$.
The code of the network is simple it takes inputs with two dimensional coordinate features and try to estimate one dimensional intensity value. Basically it is simple stacks of aforementioned Sine Layers.
class ImageSiren(nn.Module):
def __init__(
self,
hidden_features,
hidden_layers=1,
first_omega=30,
hidden_omega=30,
custom_init=None,
):
super().__init__()
in_features = 2
out_features = 1
net = []
net.append(
SineLayer(
in_features,
hidden_features,
first_layer=True,
custom_init=custom_init,
omega=first_omega,
)
)
for _ in range(hidden_layers):
net.append(
SineLayer(
hidden_features,
hidden_features,
first_layer=False,
custom_init=custom_init,
omega=hidden_omega,
)
)
final_linear = nn.Linear(hidden_features, out_features)
if custom_init is None:
paper_init(final_linear.weight, first_layer=False, omega=hidden_omega)
else:
custom_init(final_linear.weight)
net.append(final_linear)
self.net = nn.Sequential(*net)
def forward(self, x):
return self.net(x)
The code samples below all image coordinates generate coordinates using np.meshgrid and np.stack. Scipy library was used to call Sobel and Laplace functions to get first and second order derivative of image.
For simplicity we are assuming that the image that we are trying to regress is a square image.
import numpy as np
from scipy.ndimage import laplace, sobel
def generate_coordinates(n):
rows, cols = np.meshgrid(range(n), range(n), indexing="ij")
coords_abs = np.stack([rows.ravel(), cols.ravel()], axis=-1)
return coords_abs
class PixelDataset(Dataset):
def __init__(self, img):
if not (img.ndim == 2 and img.shape[0] == img.shape[1]):
raise ValueError("Only 2D square images are supported.")
self.img = img
self.size = img.shape[0]
self.coords_abs = generate_coordinates(self.size)
self.grad = np.stack([sobel(img, axis=0), sobel(img, axis=1)], axis=-1)
self.grad_norm = np.linalg.norm(self.grad, axis=-1)
self.laplace = laplace(img)
def __len__(self):
return self.size ** 2
def __getitem__(self, idx):
coords_abs = self.coords_abs[idx]
r, c = coords_abs
coords = 2 * ((coords_abs / self.size) - 0.5)
return {
"coords": coords,
"coords_abs": coords_abs,
"intensity": self.img[r, c],
"grad_norm": self.grad_norm[r, c],
"grad": self.grad[r, c],
"laplace": self.laplace[r, c],
}
torch.autograd.grad function was used to get gradient of function with respect to input coordinates. Laplacian similarly calculated using torch.autograd.grad by calculating divergence of the gradient.
class GradientUtils:
@staticmethod
def gradient(target, coords):
return torch.autograd.grad(
target, coords, grad_outputs=torch.ones_like(target), create_graph=True )[0]
@staticmethod
def divergence(grad, coords):
div = 0.0
for i in range(coords.shape[1]):
div += torch.autograd.grad(
grad[..., i], coords, torch.ones_like(grad[..., i]), create_graph=True,
)[0][..., i : i + 1]
return div
@staticmethod
def laplace(target, coords):
grad = GradientUtils.gradient(target, coords)
return GradientUtils.divergence(grad, coords)
import matplotlib.pyplot as plt
import numpy as np
import torch
from torch.nn import Linear, ReLU, Sequential
from torch.utils.data import DataLoader
import tqdm
from dataset import PixelDataset
from net import GradientUtils, ImageSiren
img_ = plt.imread("facade.png")
img = 2 * (img_ - 0.5) # standartization of data (-1,+1)
downsampling_factor = 8
img = img[::downsampling_factor, ::downsampling_factor] # reducing image resolution by skipping pixel rows and cols
size = img.shape[0]
dataset = PixelDataset(img)
We have two architectures to try out sirens and most prevalent mlp+relu architectures.
Here we have also multiple options to guide our loss function. You can optimize over intensity values, over gradient and laplacian values of the image.
n_epochs = 301
batch_size = int(size ** 2)
logging_freq = 20
model_name = "siren" # "siren", "mlp_relu"
hidden_features = 256
hidden_layers = 3
target = "intensity" # "intensity", "grad", "laplace"
Here we are creating our models, we choose adam as our optimizer.
if model_name == "siren":
model = ImageSiren(
hidden_features,
hidden_layers=hidden_layers,
hidden_omega=30,
)
elif model_name == "mlp_relu":
layers = [Linear(2, hidden_features), ReLU()]
for _ in range(hidden_layers):
layers.append(Linear(hidden_features, hidden_features))
layers.append(ReLU())
layers.append(Linear(hidden_features, 1))
model = Sequential(*layers)
for module in model.modules():
if not isinstance(module, Linear):
continue
torch.nn.init.xavier_normal_(module.weight)
else:
raise ValueError("Unsupported model")
dataloader = DataLoader(dataset, batch_size=batch_size)
optim = torch.optim.Adam(lr=1e-4, params=model.parameters())
Below you can see our training loop. As you see MSE were used as a loss criterion.
for e in range(n_epochs):
losses = []
for d_batch in tqdm.tqdm(dataloader):
x_batch = d_batch["coords"].to(torch.float32)
x_batch.requires_grad = True
y_true_batch = d_batch["intensity"].to(torch.float32)
y_true_batch = y_true_batch[:, None]
y_pred_batch = model(x_batch)
if target == "intensity":
loss = ((y_true_batch - y_pred_batch) ** 2).mean()
elif target == "grad":
y_pred_g_batch = GradientUtils.gradient(y_pred_batch, x_batch)
y_true_g_batch = d_batch["grad"].to(torch.float32)
loss = ((y_true_g_batch - y_pred_g_batch) ** 2).mean()
elif target == "laplace":
y_pred_l_batch = GradientUtils.laplace(y_pred_batch, x_batch)
y_true_l_batch = d_batch["laplace"].to(torch.float32)[:, None]
loss = ((y_true_l_batch - y_pred_l_batch) ** 2).mean()
else:
raise ValueError("Unrecognized target")
losses.append(loss.item())
optim.zero_grad()
loss.backward()
optim.step()
print(e, np.mean(losses))
if e % logging_freq == 0:
pred_img = np.zeros_like(img)
pred_img_grad_norm = np.zeros_like(img)
pred_img_laplace = np.zeros_like(img)
orig_img = np.zeros_like(img)
for d_batch in tqdm.tqdm(dataloader):
coords = d_batch["coords"].to(torch.float32)
coords.requires_grad = True
coords_abs = d_batch["coords_abs"].numpy()
pred = model(coords)
pred_n = pred.detach().numpy().squeeze()
pred_g = (
GradientUtils.gradient(pred, coords)
.norm(dim=-1)
.detach()
.numpy()
.squeeze()
)
pred_l = GradientUtils.laplace(pred, coords).detach().numpy().squeeze()
pred_img[coords_abs[:, 0], coords_abs[:, 1]] = pred_n
pred_img_grad_norm[coords_abs[:, 0], coords_abs[:, 1]] = pred_g
pred_img_laplace[coords_abs[:, 0], coords_abs[:, 1]] = pred_l
fig, axs = plt.subplots(3, 2, constrained_layout=True)
axs[0, 0].imshow(dataset.img, cmap="gray")
axs[0, 1].imshow(pred_img, cmap="gray")
axs[1, 0].imshow(dataset.grad_norm, cmap="gray")
axs[1, 1].imshow(pred_img_grad_norm, cmap="gray")
axs[2, 0].imshow(dataset.laplace, cmap="gray")
axs[2, 1].imshow(pred_img_laplace, cmap="gray")
for row in axs:
for ax in row:
ax.set_axis_off()
fig.suptitle(f"Iteration: {e}")
axs[0, 0].set_title("Ground truth")
axs[0, 1].set_title("Prediction")
plt.savefig(f"visualization/{e}.png")
Below you can see the results based on different loss function guidance options
Intensity | Gradient | Laplace |
---|---|---|
SIREN vs MLP+RELU trained on intensity values. As you may already see MLP may need more iterations to improve the intensity image, we also see that we have blank laplacian for MLP, since this type of architecture usually tends to reconstruct smooth data.
SIREN | MLP+RELU |
---|---|
You can get access to the source code using this github repo.
[1] Official SIREN project page
[2] lucidrain implementation
[3] Jan Krepl implementation
We wanted to give a small gift to our alumni who left us during the pandemic and who would leave us soon. The gift is a wooden card with an artistic interpretation of our office on the front and a small thank you note on the back. An AI-based optimization technique has been used where the model takes two images—an office point cloud rendering image and a reference artistic painting image and blends them together so that the resulting image looks like the office image, with similar brush strokes, color palette, and painting technique of the artist. We have used the works of Vincent Van Gogh, Frida Kahlo, Wassily Kandinsky, Katsushika Hokusai, Francis Picabia, William Turner, Edvard Munch, Heinrich Campendonk, Isaac Abrams as reference images. If you hover over the office images, you will see the original picture!
Collection of wooden cards
Original point cloud rendering after post processing, colored based on the scalar field of laser scanner output.
Heinrich Campendonk - Bucolic Landscape
Frida Kahlo - Self portrait
Katsushika Hokusai - The Great Wave off Kanagawa
Wassily Kandinsky - Composition-VII
Edvard Munch - Der schrei
Isaac Abrams - Tree of Life
William Turner - The Shipwreck
Francis Picabia - Udnie
Heinrich Campendonk - Yellow Animal
]]>
Vincent Van Gogh - Starry Night (it was used as an office poster)
Multi-view stereo (MVS) is a computer vision technique for reconstructing a 3D model of a scene from a set of 2D images taken from different calibrated views. It is an important problem in computer vision and has many applications, including robotics, virtual reality, and cultural heritage preservation.
The goal of MVS is to recover the 3D structure of a scene from a set of images taken from different viewpoints. This is typically done by finding correspondences between points in the different images and using them to triangulate the 3D positions of the points. Once the 3D positions of the points are known, they can be used to generate a 3D model of the scene.
MVS algorithms can be divided into two main categories: dense MVS and sparse MVS.
Finding correspondences between points in various photos can be challenging when using MVS due to the many difficulties it faces, such as occlusion, reflections, and varying lighting conditions. MVS algorithms frequently combine feature matching, optimization, and machine learning techniques to overcome these issues.
Overall, MVS is a powerful tool for understanding and reconstructing 3D scenes from 2D images and has many practical applications in a variety of fields.
Failure cases of block matching. (Image credit: Andreas Geiger)
Traditional multi-view reconstruction approaches use hand-crafted similarity metrics Like (e.g. NCC) and regularizations techniques (SGM [6]) to recover 3D points. It is reported in recent MVS benchmarks [1, 8] that, although traditional algorithms [2, 3, 10] perform very well on the accuracy, the reconstruction completeness still has large room for improvement. The main reason for the low completeness of traditional methods is because the hand-crafted similarity measure and block matching method mainly works well with Lambertian surfaces and fail in the following failure cases:
Zbontar et al. [15] have shown that doing block matching on feature spaces can give more robust results and can be used for depth perception in a two-view stereo setting. The goal of Multi-view stereo techniques is to estimate the dense representation from overlapping calibrated views. Recent learning-based MVS methods [11, 13, 14] were able to get more complete scene representations by learning the depth maps from feature space.
Left: Traditional 3D reconstruction method [10]
Right: Deep Learning-based
Here, we will describe the depth estimation in two-view and multi-view settings where the pose information is known. We describe both traditional methods and learning-based methods.
Monocular vision has a scale ambiguity issue which makes it impossible to triangulate the scene with the correct scale. In a simple explanation, if the distance of the scene from the camera and geometry of the scene were scaled by some positive factor k, independently from the value of the k image plane will always have the same projection of the scene.
\[\begin{equation} \begin{gathered} (X,Y,Z)^T \longmapsto ( f X/Z +o_x , f Y/Z +o_y)^T\\ (kX,kY,kZ)^T \longmapsto ( f kX/kZ +o_x , f kY/kZ +o_y)^T = ( f X/Z +o_x , f Y/Z +o_y)^T\\ \end{gathered} \end{equation}\]Without any prior information, it is also impossible to perceive scene geometry from a single RGB image. The most popular way of constructing and perceiving scene geometry is having a motion to have a different camera view as shown below.
Left frame does not give much information if there is one or two spheres in the scene, seeing also right frame gives better understanding to viewer and lets viewer have perception of two spheres with different colors. (Image credit: Arne Nordmann )
Even having multiple monocular views without knowing extrinsic calibration will not resolve scale ambiguity. Again relative pose between views and camera to scene distance were scaled by some positive factor k, independently from the value of the k image plane will always have the same projection of the scene. Figure below shows that the point will have the same projection independent from the scale factor of k.
Scale ambiguity of the two view system without knowing the relative pose.
This section will mainly cover, scene triangulation in two and multi-view settings with known relative pose transformations.
Before diving into the two-view triangulation methods, this subsection will introduce the multiple view geometry basics and conventions. Let’s assume $x_1$ and $x_2$ are the projection of 3D point X in homogeneous coordinates in two different frames. R and T are rotation and translation from the first frame to the second frame. $\lambda_1$ and $\lambda_2$ are distances from the camera centers to the 3D point X.
\[\begin{equation} \begin{gathered} \lambda_1x_1 = X \quad and \quad \lambda_2x_2 = RX + T\\ \lambda_2x_2 = R(\lambda_1x_1) + T\\ \hat{v}v = v \times v = 0\quad \text{hat operator} \\ \lambda_2\hat{T}x_2 = \hat{T}R(\lambda_1x_1) + \hat{T}T=\hat{T}R(\lambda_1x_1)\\ \lambda_2 x_2^T(\hat{T}x_2) = \lambda_1x_2^T\hat{T}Rx_1\\ x2 \bot \hat{T}x_2\\ x_2^T\hat{T}Rx_1 = 0\quad \text{epipolar constraint} \\ E = \hat{T}R\quad \text{essential matrix} \\ \end{gathered}\ \end{equation}{}\]$x’_i$ being image coordinate of $x_i$ and the K being intrinsic matrix the equation can be formulated more generic for uncalibrated views.
\[\begin{equation} \begin{gathered} x^{\prime T}_2 K^{-T}\hat{T}R K^{-1}x'_1 = 0 \\ F= K^{-T}\hat{T}RK^{-1}\quad \text{fundamental matrix}\\ x^{\prime T}_2Fx'_1 = 0\\ \quad E= K^TFK\quad \text{relation between essential and fundamental matrix}\\ \end{gathered}\ \end{equation}{}\]Because of the sensor noise and discretization step in image formation, there is noise in pixel coordinates of the 3D point projections. So the extensions of corresponding points in image planes usually do not intersect in 3D. This noise should be considered for getting accurate triangulation of corresponding points. There are multiple ways of doing two-view triangulation. Two of those methods will be covered here.
Midpoint point for two-view triangulation.
A midpoint triangulation method is a simple approach for two-view triangulation. As shown in Figure above, the idea is finding the closest distance between the bearing vectors which are rays extensions from the camera intersecting the image plane at corresponding points. $Q_1$ and $Q_2$ are the points on these rays where the rays are at the closest point to each other. The line passing through $Q_1Q_2$ should be perpendicular to these bearing vectors for $Q_1Q_2$ being the closest distance between these rays. The midpoint of the $Q_1Q_2$ is accepted as a valid 3D triangulation of the corresponding points. $\lambda_i$ is being the scalar distance from the camera center to the 3D point $Q_i$, R and T are being relative pose from the second camera frame to the first camera, the approach mathematically can be formulated as below:
\[\begin{equation} \begin{gathered} Q_1 = \lambda_1 d_1 \quad Q_2 = \lambda_2 Rd_2 + T \quad \text{($C_1$ is chosen to be origin for simplicity)}\\ (Q_1-Q_2)^T d_1 = 0 \quad (Q_1-Q_2)^T Rd_2 = 0 \ \text{(dot product of perpendicular lines)}\\ \lambda_1 d_1^Td_1 - \lambda_2R d_2^Td_1 = T^Td_1\\ \lambda_1 d_1^TRd_2 - \lambda_2R d_2^TRd_2 = T^TRd_2\\ \begin{pmatrix} d_1^Td_1 && -(R d_2)^Td_1 \\ (R d_2)^Td_1 && (R d_2)^T(Rd_2) \\ \end{pmatrix} \begin{pmatrix} \lambda_1 \\ \lambda_2 \\ \end{pmatrix} = \begin{pmatrix} T^Td_1\\ T^T(Rd_2) \\ \end{pmatrix} \quad \text{ (Ax = b) form equation}\\ \begin{pmatrix} \lambda_1 \\ \lambda_2 \\ \end{pmatrix} = A^{-1}b\\ P= (Q_1+Q_2)/2 = (\lambda_1d_1 + \lambda_2Rd_2+T)/2 \end{gathered} \end{equation}{}\]This method is based on the fact that in the ideal case back-projected rays and rays from the camera center to the correspondence on the image plane should be aligned. The cross-product of these two vectors should be equal to zero in the ideal case. Using this knowledge problem converted to a set of linear equations that can be solved by SVD. Let x and y be correspondences, $P$, and $P$ are respective perspective projection matrices of the camera, and $\lambda_x$ and $\lambda_y$ scalar values.
\(\begin{equation} \begin{gathered} x = \begin{pmatrix} u_x \\ v_x \\ 1 \end{pmatrix} \quad y = \begin{pmatrix} u_y \\ v_y \\ 1 \end{pmatrix}\\ \lambda_x x = PX \quad \lambda_x y = QX\\ x \times PX = 0 \quad y \times QX = 0\\ \begin{pmatrix} u_x \\ v_x \\ 1 \end{pmatrix} \times \begin{pmatrix} p_1^T \\ p_2^T \\ p_3^T \end{pmatrix}X = 0 \quad \begin{pmatrix} u_y \\ v_y \\ 1 \end{pmatrix} \times \begin{pmatrix} q_1^T \\ q_2^T \\ q_3^T \end{pmatrix}X = 0\\ \begin{pmatrix} v_x p_3^T-p_2^T \\ p_1^T- u_x p_3^T \\ u_x p_2^T - v_x p_1^T \end{pmatrix}X = 0\quad \begin{pmatrix} v_y q_3^T-q_2^T \\ q_1^T- u_y q_3^T \\ u_y q_2^T - v_y q_1^T \end{pmatrix}X = 0\\ \begin{pmatrix} v_x p_3^T-p_2^T \\ p_1^T- u_x p_3^T \\ v_y q_3^T-q_2^T \\ q_1^T- u_y q_3^T \end{pmatrix}X = 0\\ AX = 0 \end{gathered} \end{equation}{}\) Solutions for X can be easily calculated using the Singular Value Decomposition.
Multi view triangulation method. [9]
For multi-view triangulation, one solution can be to find such an X in 3D space which has a minimum sum of square distance with the 3D points lying on the bearing vector. Analytical solution for X can be found by taking the derivative of the loss function with respect to the 3D point and finding the 3D point where is the derivative of loss function equals zero. Assuming the $C_i$ is camera center of $i^{th}$ camera, $P_i$ is the point on $i^{th}$ bearing vector, $\lambda_i$ scalar distance between $C_i$ and $P_i$, and X is being optimal 3D point as a triangulation result , the triangulation result can be formulated as below.
\[\begin{equation} \begin{gathered} P_i = C_i + \lambda_id_i \\ \lambda_id_i \sim X - C_i \quad \text{At ideal case with no noise} \\ \lambda_i = \lambda_id_i^T d_i \sim d_i^T(X-C_i)\\ P_i = C_i + \lambda_id_i \sim C_i + d_i d_i^T(X-C_i) \\ r = X - C_i - d_i d_i^T(X-C_i) = (I - d_id_i^T)(X-C_i) \\ \mathcal{L} = \sum_{i=1}^N r^2 = \sum_{i=1}^N ((I - d_id_i^T)(X-C_i))^2 \\ \arg\min_{x} \mathcal{L} \Rightarrow \frac{\partial\mathcal{L}}{\partial X} = 0 \\ \frac{\partial\mathcal{L}}{\partial X} = 2 \sum_{i=1}^N (I - d_id_i^T)^2(X-C_i) = 0\\ A_i = (I - d_id_i^T)\Rightarrow \frac{\partial\mathcal{L}}{\partial X} = \sum_{i=1}^N A_i^TA_i(X-C_i) = 0\\ X = (\sum_{i=1}^N A_i^TA_i)^{-1}\sum_{i=1}^N A_i^TA_iC_i \end{gathered} \end{equation}{}\]Simaese networks for stereo matching[15].
Zbontar et al. [15] have initially shown that depth information can be extracted from rectified image pairs by learning a similarity measure on relevant image patches. They train their CNN-based siamese network as a binary classification network with similar and irrelevant pairs of patches.
GC-Net - deep stereo regression architecture[7].
Kendall et al. [7] proposed the network where they use 2D CNN with shared weights to retrieve rectified image pair features. In their work, they later used these feature maps to calculate a matching score-based cost volume, and as the last step, they use a 3D CNN-based autoencoder to regularize this volume.
Patch-Based Multi-View Stereo [2] has proven being quite effective in practice. After an initial feature matching step aimed at constructing a sparse set of photoconsistent patches, in the sense of the previous section—that is, patches whose projections in the images where they are visible have similar brightness or color patterns—it divides the input images into small square cells a few pixels across, and attempts to reconstruct a patch in each one of them, using the cell connectivity to propose new patches, and visibility constraints to filter out incorrect ones. We assume throughout that n cameras with known intrinsic and extrinsic parameters observe a static scene, and respectively denote by $O_i$ and $I_i (i = 1, …, n)$ the optical centers of these cameras and the images they have recorded of the scene. The main elements of the PMVS model of multi-view stereo fusion and scene reconstruction are small rectangular patches, intended to be tangent to the observed surfaces, and a few of these patches’ key properties—namely, their geometry, which images they are visible in and whether they are photoconsistent with those, and some notion of connectivity inherited from image topology.
State-of-the-art learning-based MVS approaches adapt the photogrammetry-based MVS algorithms by implementing them as a set of differentiable operations defined in the feature space. MVSNet [13] introduced good quality 3D reconstruction by regularizing the cost volume that was computed using differentiable homography on feature maps of the reference and source images.
\[\begin{equation} \begin{gathered} p_a =\frac{1}{z_a}K_aH_{ab}z_bK_b^{-1}p_b=\frac{z_b}{z_a}K_aH_{ab}K_b^{-1}p_b\\ \quad \quad\text{"R" and "t" are relative pose of a with respect to b}\\ H_{ab}P_b = R*P_b+t\\ \text{Plane constraint $n^TP_b+d = 0$.}\\ H_{ab}P_b = RP_b+t\frac{-n^TP_b}{d} \quad \text{since $-n^TP_b/d$ =1 }\\ H_{ab} = R-\frac{n^Tt}{d} \end{gathered} \end{equation}{}\]
Two view homography [5] [4] [12]
MVSNet architecture [13]
MVSNet at first extract the deep features of the N (number of views) input images for dense matching. It applies convolutional filters to extract the feature towers scale.Each convolutional layer is followed by a batch-normalization (BN) layer and a rectified linear unit (ReLU) the last layer. Using features and the camera parameters, then we build cost volume regularization. We use differentiable homography for building this cost volumes.
The raw cost volume computed from image features are regularized later. Multi-scale 3DCNN have been used for cost volume regularization.This regularization step is designed for refining the above cost volume to generate a probability volume for depth inference.Depth that was regressed from probability volume is further refined using the 2DCNN network.
Traditional MVS methods have good accuracy but struggle with completeness, while recently developed learning-based multi-view stereo (MVS) techniques have improved completeness except accuracy being compromised. We propose depth discontinuity learning for MVS methods, which further improves accuracy while retaining the completeness of the reconstruction. Our idea is to jointly estimate the depth and boundary maps where the boundary maps are explicitly used for further refinement of the depth maps. We validate our idea and demonstrate that our strategies can be easily integrated into the existing learning-based MVS pipeline where the reconstruction depends on high-quality depth map estimation. Extensive experiments on various datasets show that our method improves reconstruction quality compared to baseline. Experiments also demonstrate that the presented model and strategies have good generalization capabilities.
We propose to estimate depth as a bimodal univariate distribution. Using this depth representation, we improve multi-view depth reconstruction, espe- cially across geometric boundaries.
An overview of the proposed multi-view depth discontinuity learning network that outputs depth and edge information for each pixel. The brown arrows represent input feed and blue arrows represent pipeline flow. We first extract multi-scale features from color images with FPN [35] alike auto-encoder. Then we feed extracted features and camera parameters to the coarse-to-fine PatchMatch stereo module to extract the initial depth map. Using the initial depth map and RGB pair, our network learns bimodal depth parameters and geometric edge maps. We use mixture parameters and photo-geometric filtering to compute our final depth map. The edge map visualized here is negated edge map (for a clear view)
]]>
Comparison between our method and the baseline method PatchmatchNet on a set of scenes from the Tank and Temples dataset. For each scene, the top row shows the results from PatchmatchNet, and the bottom row shows the results from our method. A zoomed view of the marked image region is shown on the right of each result.
Nail Ibrahimli, Shenglan Du
1 Background
We live in a 3 Dimensional (3D) world that is composed of urban entities such as buildings and streets. In recent years, there is an ever-increasing demand, both by academia and industry, for 3D spatial information and 3D city models [Biljecki et al., 2015]. Modelling urban scenes from 3D data has become a fundamental task for various applications such as architecture design, infrastructure planning, land administration, and city management.
One common way to obtain 3D models of buildings is to reconstruct the surface meshes from point clouds (i.e., a discrete data representation of the surrounding space acquired from LiDAR systems). There exist amounts of works for point cloud reconstruction [Kazhdan et al., 2006; Nan and Wonka, 2017]. Although point clouds can well preserve the raw geometric information of the urban objects, the reconstructed models still suffer from issues such as data noises and undesired structures. The goal of this assignment is to:
2 Dataset Explanation
Given a noisy reconstructed polyhedron mesh model with N vertices, we want to find the minimum change of the coordinates to make the model geometry regular. Figure 1 gives a visualization of the 4 models which we use to test our algorithm.
Figure 1. Visualization of the input models.
3 Problem Statement
We directly take the points P of the polyhedron model as input to our optimization algorithm, which could be formulated as
\[P=(x_1, y_1, z_1, x_2, y_2, z_2, x_3, y_3, z_3, ... , x_n, y_n, z_n)\]where each $(x,y,z)$ represents the 3D coordinates of a point of the model, n represents the total number of points of the model. Clearly, we have $P \in R^{3n}$.
As the model is noisy, its geometry is usually not perfectly regular. For instance, pair edges/faces which suppose to be parallel are not perfectly parallel, and intersected edges/faces which suppose to be orthogonal are not perfectly orthogonal (Figure 1 gives a visualization of the noisy models with irregular geometry).
In this project, we aim to find the minimum change of the coordinates $\Delta P$ to make the model geometry regular. we focus on one specific geometrical constraint:
3.1 Data preprocessing
We start by searching for near-orthogonal pairs of edges of the model. These edge pairs will form the orthogonal constraints later. The main idea is to loop over all possible combinations of edges in the model, calculate their intersecting angles, and if the angle is near 90 degrees, then store the corresponding edge pair in a set S.
Figure 2. Near-orthogonal edge pairs of a cube model.
3.2 Problem Formulation
Our goal is to find a new vector $X \in R^{3n}$ which produces perfect orthogonal relationships for all the edge pairs in the set S (S has been obtained from the preprocessing step), and at the meantime, generates the least cost of modifying the model geometry $||X-P||_2$ .
Assume we have an edge $e_i$ with a starting point $p_s$ and a tail point $p_t$, the geometry of $e_i$ can be represented as:
\[(x_t-x_s,y_t-y_s,z_t-z_s)\]In a matrix formulation the geometry should be:
\[X^TE_i\]where $E_i$ is a $3n \times 3$ matrix. The elements of $E_{(3s,0)},E_{(3s+1,1)}$ and $E_{(3s+2,2)}$ are -1, while the elements of $E_{(3t,0)},E_{(3t+1,1)}$ and $E_{(3t+2,2)}$ are +1. All the rest elements are 0.
Similarly, for another edge $e_j$, we can represent its geometry as: \(X^TE_j\) Given an edge pair {$e_i,e_j$} in the set S, we exploit their orthogonal relationship by calculating the dot product of the two edge vectors:
\[d = (X^TE_i) \cdot (X^TE_j) = X^TE_i(X^TE_j)^T= X^TE_iE_j^TX\]
Figure 3. Visualization of an edge pair
Therefore, the problem is formulated as
\[\text{ min } ||X-P||_2 \\ s.t.\ X^TE_iE_j^TX = 0, \text{ and } (i,j) \in S\]where $X\in R^{3n}$ is the variable we want to optimize over. $P\in R^{3n}$ is the vector of the raw point 3D coordinates. $S$ is the set of near orthogonal edge pairs that is obtained during the Section 3.1 data preprocessing step.
3.3 Problem Approximation
In Section 3.2 we give the true formulation of the problem, which has a convex objective function and several quadratic equality constraints. Obviously, it is not a convex problem.
For solving the problem with a gradient-based method, we approximate this non-convex problem into a convex problem. Inspired by the strategy of weight regularization which is commonly used in deep learning neural networks [Goodfellow et al., 2016], we transfer the quadratic constraints as a regularization term of the existing objective function.
The approximate optimization is formulated as
\[\text{min } ||X-P||_2 + \lambda \sum_{i,j\in S} |X^TE_iE_j^TX|\]where $\lambda$ is a constant term that determines how much influence the orthogonal regularizer contributes to the objective function. $ |\cdot | $ gives the absolute value of the term inside.
In this way, we eliminate the quadratic constraints and approximate the problem to an unconstrained convex problem. We can easily solve this problem using the gradient descent method. More specifically, we refer to a sub-gradient method, as the absolute function $ |\cdot | $ doesn’t have its gradient when the inside term reaches 0. Section 4.2 gives a detailed derivation of the gradient method and our experimental parameters of $\lambda$ and learning rate $\eta$.
4 Implementation Details
We implemented both optimizers in Modern C++ (versions after >= C++11). We have used Easy3D [Nan, 2021] for linear algebra operations and for rendering. The source code and the datasets are available at:
https://github.com/Mirmix/3D_shape_regularization
4.2 Solving with the Sub-Gradient Method
We have used the Eigen library for Linear Algebra operations. According to Section 3.3, we have derived our loss function as follows. Our goal is to minimize this loss function using the sub-gradient descent method.
\[\mathcal{L} = ||X-P||_2 + \lambda \sum_{i,j\in S} |X^TE_iE_j^TX|\]The total loss is composed of two parts. The first term penalizes based on the L2 norm of $X$ deviation from input $P$. The second L1 norm functions as an orthogonal regularizer, which penalizes over the residual of the inner product of two orthogonal edges.
The gradient of the first term w.r.t $X$ is $2(X-P)$, while the second term is more tricky, because it depends on the sign of each term $ |X^TE_iE_j^TX| $ for all edge pairs {$i,j$} in the set S. According to that, we give the pseudocode of the sub-gradient descent method:
While updating the gradient $g(X_k)$, we decide to choose plus or minus based on the sign of each single dot product term $X_k^TE_iE_j^TX_k$. Thus, $X_{k+1} = X_k-\eta g(X_k)$ helps us iteratively approach to optimal solution via gradient scaled by the learning rate. Additionally, we have 2 hyperparameters to be tuned: $\lambda$ and $\eta$.
We have experimented with our sub-gradient descent method over the 4 models. Figure 4 gives the visualization of the losses during the optimization process. Different colours indicate the loss convergence for different models. The legend on the top right describes the model names. The visualization shows that our algorithm is effective for all the input models.
Figure 4. Loss convergence of the 4 models over 500 steps
References:
[1] Biljecki, F., Stoter, J., Ledoux, H., Zlatanova, S., and Coltekin, A. (2015). Applications of 3d city models: State of the art review. ISPRS International Journal of Geo-Information, 4(4):2842-2889.
[2] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
[3] Kazhdan, Michael, Matthew Bolitho, and Hugues Hoppe. “Poisson surface reconstruction.” Proceedings of the fourth Eurographics symposium on Geometry processing. Vol. 7. 2006.
[4] Nan, L., & Wonka, P. (2017). Polyfit: Polygonal surface reconstruction from point clouds. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2353-2361).
[5] Nan, L. (2021). Easy3D: a lightweight, easy-to-use, and efficient C++ library for processing and rendering 3D data.
]]>