In this post I'll tackle Azure Service Fabric Cluster deployment issues. I'll go through some of the problems I've seen and solutions that worked for me. Hope this helps people out there.

Azure Service Fabric Cluster stuck on the Status: "Waiting for nodes" after deployment - Certificate Thumbprint Issue

The issue I'm about to share may be considered quite trivial but me and my colleagues did end up spending quite a bit of time on it as we weren't familiar much with Azure Service Fabric Clusters and our goal was to deploy a secure azure service fabric cluster into our project's enterprise azure subscription.
After locking in all the configuration and triggering a deployment all was fine in the Azure cloud and our cluster got deployed successfully. Happy faces all around, until we realized the cluster never reached a status of ready so we can begin to work with it.
I think the main reason I'm writing about this is that the cluster deploys successfully and you get no error messages but you still can't use it as it's not fully working. It's hard to figure out why if you've not delved into service fabric clusters before I would assume. The only noticeable behavior is that the status is listed as: "Waiting for nodes". If you see this for a bit longer than what you would expect then something is wrong and time won't fix it for you so don't wait for some magic behind the scenes, there's no help coming. Basically the nodes fail to join the cluster but Azure so far doesn't surface any errors in the portal logs. The cluster view looked like the below:
I've tried various deployment approaches thinking it was the azure ARM template deployment I was using that was to blame but the outcome was the same no matter how I deployed the cluster, manually or otherwise.
I did notice that provisioning a non-secure cluster worked just fine so clearly the crux of the matter was related to the security aspect and most likely to the certificate used for the deployment.
To cut a long story short, the problem was a very simple little issue I experienced many times in my career, a copy/paste flaw :)
The certificate thumbprint wasn't getting copied correctly from the certificates mmc nor from PowerShell ISE when I was querying for it using the below cmdlet.
1
Get-ChildItem -path cert:\CurrentUser\My |? {$_.subject -match "CN=testsfcluster.westeurope.cloudapp.azure.com"}
It looked exactly the same but there was an invisible character that gets copied that you can't see. Unlike other times there is no space either, it looks the same, but it isn't the same..
As a result the nodes couldn't join the cluster because they were using an incorrect thumbprint to identify the certificate.

Fix:

Type in the thumbprint manually.. OR the only way I could get copy/paste to copy the right thumbprint without any added hidden hitchhikers was to dump the certificate details via PowerShell in a text file and copy from there as shown below:
1
Get-ChildItem -path cert:\CurrentUser\My |? {$_.subject -match "CN=testsfcluster.westeurope.cloudapp.azure.com"} >res.txt

GUI confusion

If you opt to create the service fabric cluster manually you will come across the screen where you have to secure the cluster with certificate details. The description of the fields required might send you down the wrong path as they did me the first time I tried this. As you can see below you need to fill in the key vault details and the certificate URL and thumbprint. You can get the URL from the key vault after you upload the certificate. But, the thumbprint field description says: "This refers to the thumbprint of the certificate, which can be found in the Certificate URL specified earlier"
If that convinces you to copy the last part of the URL and paste it in the thumbprint field then please stop as that is not the way to go. I first tried that and the validation unfortunately goes through just fine as seen below:
A certificate thumbprint has 40 characters but the above has only 32, still the validation is fine and it will go ahead and create the cluster for you but then you'll have the "waiting for nodes" situation without any hint as to what the problem is unless you remotely connect to the virtual machine scale set created for the cluster and check the event log.
You need to put in the thumbprint that you get from the certificate using the certificates mmc or a PowerShell cmdlet like the one I included above. This is of course if you have the certificate installed locally on your machine, if it is a public certificate make sure you have the correct thumbprint when you trigger the cluster creation.
If you suspect thumbprint copy/paste problems best type the thumbprint by hand to rule this issue out as you can get caught out easily.

Azure Service Fabric Cluster stuck on the Status: "Waiting for nodes" after deployment - Key Vault Issue

While trying to reproduce my initial error I've come across another one. Again the cluster deployment runs successfully and you get a successful deployment message back from the Azure fabric, but the cluster is again stuck in the "waiting for nodes" status as seen below:
This time around the issue was that the virtual machine scale set that is provisioned behind the scenes for the nodes of the cluster actually failed to retrieve the certificates from Key Vault and that left the nodes in a stuck state, never joining the cluster.
Once I tried to browse to the scale set I was able to see the below errors, unlike the previous "waiting for nodes" issue where there was no sign in the portal of any problems:
The error reads: "Key Vault either has not been enabled for deployment or the vault id provided doesn't match the key vault's true resource id ".
Armed with this piece of information I was now able to fix this issue. I browsed to the key vault and noticed the below under advanced access policies:
I checked the 2 top options and saved the configuration:
Now a re-deployment was needed in order to move on, of course after a clean-up of all the resources that were created for the previous cluster that failed.
Simplest approach I find is to create the service fabric cluster in its own dedicated resource group which makes it easy to then delete the whole resource group and start again for a clean re-try. This is opposed to having to individually delete all of the resources that get created with the deployment, which are quite a few as seen below:
Immediately after you deploy a service fabric cluster it may have the below status in the Azure Portal: again "waiting for nodes"
But it you've been following my instructions it will then move into the "baseline upgrade" status as seen below:
Then it will show a status of "Auto-scaling"
And then it will finally be available to use showing a status of "Ready":

Hope all this helps people starting out with Azure Service Fabric clusters. I'll keep sharing future experiences with Azure and more. Stay tuned.



Relevant links:
https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-creation-via-arm/
https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-creation-via-portal/
https://github.com/Azure/azure-content/tree/master/articles/service-fabric