Information for the SCSD program

Christopher J. Kingsbury; cjkingsbury@gmail.com

Welcome to SCSD - A generic symmetry analysis tool written by Dr. Chris Kingsbury at Trinity College Dublin. This tool should be easy to use, but may be a little difficult to understand, so please use the tutorial and read through this article to familiarise yourself with it before publishing.

Essentially, this tool is a logical extension of Shelnutt's "NSD" routine applied using something like Largent's "Symmetrise" routine - so that the 'modes' which are fit can be arbitrary for generic molecules. These modes, unlike in a usual principal component analysis, are divided along the irreducible representations of the parent symmetry group, which allows, among other things, one to determine and restrict the symmetry operations to certain groups, to observe minor deviations and to directly correlate structure and electronic properties (such as splitting) with static calculated orbitals. Effectively we're separating the signal from the noise - minor movement from actual dissymmetry - by using encoded symmetric information within the molecules. Also, we can look at effects on symmetric orbitals by only using the modes that'll affect them - probably easier to correlate. Think of it like a super-powered 'deviation from plane' calculation - a method of quantitatively assigning a particular shape to a complex molecule.

The pdb-format files which are used for input are generally required to be stripped of extraneous atoms - while there may be a way to include this routine in a GUI at some point in the future, at the moment, the input relies on only being provided the specific atoms which are of interest or a model which can 'trim' the atom set to a set framework, otherwise the results will be rendered meaningless.

The procedure works in two different ways, depending on whether a model (a totally symmetric and minimally distorted representation of the same molecule, pre-aligned to the symmetry operations) is provided.

With model (spatial):

The provided pdb atoms are fitted to the model by use of a combined translation and rotation - using a mixed cartesian and quaternion optimisation with a cost matrix derived by the numpy linear_sum_assignment method. The operation will similarly trim any atom types which are not found in the model, trims any unfitted atoms, but is imperfect and may fail to find the right fragment. This should be apparent in the interactive Figure 2 where the 'A1g' doesn't look right. Make sure that the correct atoms have been identified before quoting numbers, try again with 'basinhopping' if it won't find the right fragment.

Without model:

A temporary model is generated; query atoms are fitted directly to the symmetry operations which are applied to all atoms, minimising this value through a least-squares refinement. Then, the process continues as above. This should retrieve the same values (except A1/A1g) as for the "with model" version except in cases with wrong assignment or interchangeable axes. In this scenario, however, there is no trimming of the query atom list, and therefore the .pdb must be edited to exclude all atoms not under analysis i.e. not in the aromatic unit.

The with-model approach can similarly be used in unattended mode - this allows processing of thousands of structures for statistical analysis. This is non-trivial to set up - please contact me if this is somthing you'd like to do.

Once aligned, the SCSD routine is engaged - this takes the molecule and effectively 'symmetrises' it to the symmetry operations within each of the irreducible representations, generating a set of vectors which relate this partially symmetric version to the next-highest symmetric representation (i.e. the model, A1g, A1, Ag). This vector distance sum is the 'SCSD' value, equivalent to an NSD value - and the sum of all of these vectors gives the original atom positions.

The reason that only the C_2v, C_2h, D_2h and D_4h are available is a question of comparative ability - even the D_4h moeity has two different E_g- and E_u- orientations (8 modes in total), related by a π/4 rotation around the principal axis. When dealing with 3-fold, 5-fold or multi-axis groups, the multiplicity becomes messy, so make sure you know what you're doing. Some of these labels are swapped around from where they're supposed to be.

The vector sets, values and orientations are all saved for interpretation in the multistructure mode. The principal component analysis tools which are included in the python package are quite powerful for how simple they are, but these can be augmented by using the other methodologies within sklearn, or with any other set. The results are generic pandas dataframes, which should provide for perfect inter-operability with python packages, and saving. In principle, we've found that the first PCA parameter is responsible for about 90% of asymmetric distortion along that symmetry in highly distorted materials, though this hasn't been blind - tested due to other tasks. Thus, for most arguments, we can say that the SCSD parameters can be used in basically the same manner as NSD parameters are used for porphyrins - e.g. correlating structure with photophysical properties. Applying these PCA parameters, we can start to cluster structural groups, and apply bidirectional interpretation of spectroscopic evidence.

Interpreting SCSD results:

The individual SCSD results are presented in three forms, in a table, which shows the numeric values, a Mondrian diagram which demonstrates the symmetry implications of these distortions, and as an interactive 3D figure to understand the movement. Additional tables are available int he 'raw data' box, and file exports are planned for future versions.

The simplest table shows the Mean and Sum values of the deviations, in Å, of the molecule along vectors aligned to each of the irreducible representations within the indicated symmetry group. These parameters, where equivalent (i.e. in D_2h) are dependent on the orientation of the molecule, so try to use a model for comparison between multiple data sets where appropriate.

The Mondrian diagram can be interpreted the same way as the porphyrin versions - which have been explained in (A paper, which I will link later). Essentially, the perfectly symmetric version is in the top right corner, and as you cross the boundaries, you "add" new deviations, and the symmetry might be reduced. This allows one to view the effect of only the biggest distortions (the ones that might be present in solution) on the symmetry, for interpreting NMR or spectroscopic evidence - this point group should be represented somewhere in the centre to upper-right quadrant. Going further down and left, you'll find the lowered PG symmetry present in the solid state (usually just that inherited from SG) or from restricted refinement, which could be different. This of course depends on the boundaries of the plot, so some elements can move in interpretation with large or small atom numbers - this is generally designed for small aromatic molecules, like bodipy or anthracene. Variable plot boundaries are one of those things I'd like to implement, but somewhat take away from the simplicity and immediate interpretability.

The Mondrian diagram can be considered like a decorative water fountain, flowing from high-symmetry to low-symmetry.

The 3D plot is relatively self-explanatory, just showing the different modes in 3D space. though there is some issue when modes selected are all highly 2D - the plot zooms in automatically, and this is just a bug in plotly, I think. The PCA html interface (from a decomposed large data set) is a little easier to understand, given how the 2D and 3D modes are separated for planar-ideal molecules.

In future, a raw output will also be available, to allow for e.g. external plotting. (edit) this is in the expandable box at the base of the output.

Edit 9/7/21: There are additions to the routine which have been added- automated integration of the principal components toolbox, comparative structure search and the "Atom posits sum" and "Atom posits data" in the interactive plot. The "Atom posits sum" and "Atom posits data" should be identical - if not, something has gone wrong in the analysis portion. Try "Basinhopping", cutting out non-essential atoms or try to use a comparative minimally distorted model i.e. through kingsbury.id.au/scsd_2f . It is possible that an automatic assignment matrix is not going to work for very distorted structures, and a workaround is going to take a full rewrite, probably. The system will additionally balk at assigning different atoms to one another, so check that the atom names are the same as what they're supposed to be - i.e. a thiaporphyrin may not be recognised as a porphyrin, for example.

The comparative structures, when a database is provided, are the five closest structures, in terms of the principal components - these should be the nearest structures in real space, basically, though without chiral information (i.e. (+) and (-) will give the same abs(pca) values and will therefore flag as similar). These can be a guide to what structures are important to reference for a comparative analysis.

The interactive plot will not show bonds between atoms exceeding 1.7 Å as a general feature, and this may cut your model off, especially for metal complexes. The scsd values should be correct, however - check the raw data output to see that e.g. the metal atom is included. Using a comparative model allows us to tweak this value (through dist_dict in scsd_models)

Edit 19/7/21: Added support for D3h, D5h and D6h - now covering most fluorophores, nanostructures and components at their highest symmetry. This is still a manual trial-and-error process making sure the mode assignments are right, but there are signs of progress. The E-modes in Dxh, similar to D4h, sum to larger than the equivalent atom structure, so are multiplied by the formula (2/number_of_modes) to arrive at the correct positions. As a result, the symmetry may be higher than that of the sum of non-orthogonal modes in e.g. D7h, but I couldn't find a general procedure for handling this. A validation mode was added (Atom Posits Sum and Atom Posits Data, should be identical) to check that the results were calculated correctly - this automatic algorithm has its limits and gets confused sometimes when things are really distorted. I'm working on it.

Some models now have the ability to extend the standard bond distance, I've fixed a bug (thanks Charles) in yield_model

Edit 30/9/21: Version 3 is online, and includes automatic assignment matrix generation, which imporves speed and relaibility with edge cases significantly. The core is rewritten in an object-oriented fashion to allow for simplification of data and model handling, to do away with endless dictionaries in the scsd_symmetry module. This code will be publically available shortly.

The big change is the databases available - the scsd database is now searchable in its entirety. If you wish to investigate multiple structures, this data can be sent through - always happy to collaborate!

Please contact me regarding implemetation of new symmetry groups, models, principal components, multicomponent models or if you'd like to give me money to do any of the above. My email address is ckingsbu@tcd.ie