Skip to content

Data reduction

Ti-Yen Lan edited this page Jun 19, 2018 · 22 revisions

The data reduction step can be divided into the following processes:

Here we generate the necessary files for the EMC reconstruction. For the test dataset, this step can be skipped by directly using the reduced data provided by us, which will be described in Skipping data reduction.

Mapping detector pixels

We start by updating the experimental parameters in the [make-detector] section of config.ini:

[make-detector]
# pixel
num_row = 2527
num_col = 2463
cx = 1285.5
cy = 1262.0
Rstop = 115.0

# meter
detd = 0.45
px = 172e-6

# angstrom
wl = 1.03324
res_max = 1.95

# beam incidence direction
sx = 0.005
sy = -0.01
sz = -1

The parameters num_row and num_col specify the detector size in pixels. The pixels are labeled by coordinates (x,y), with x = 0, 1,..., num_row−1 and y = 0, 1,..., num_col−1. With this choice of coordinates, the upper-left detector pixel has coordinates (0,0), and the X-ray beam is incident in the -z direction. We assume the X-ray polarization is in the y direction, because the main application of our program is on the analysis of SMX data taken at storage ring synchrotron sources. The parameters (cx, cy) label the beam incidence point on the detector, and Rstop is the beamstop radius in pixels. The other parameters include detd, the sample-to-detector distance, px, the squared detector pixel size, wl, the incident X-ray wavelength, and res_max, the maximum resolution of the pixels that will be considered in the reconstruction. The vector (sx,sy,sz) indicates the beam incidence direction (does not have to be normalized), and is typically set as (0, 0, −1).

After updating the parameters, we move to the directory make-detector and execute the command

python make-mask.py [path to frame] > run.log

to generate the file mask.dat in the directory aux to exclude the detector gaps and the pixels shadowed by the beamstop holder. Here [path to frame] is the path to one of the data frames in the cbf format. You should expect to see a masked data frame that looks like:



The beamstop region will be masked out in the files that record the mapping of the detector pixels to reciprocal space, which are obtained by executing the commands:

gcc make-detector.c -O3 -lm -o det
./det ../config.ini >> run.log

Background estimation and peak finding

After moving to the directory make-background, we generate the lists of the filenames associated with each data frame using the command:

python make-filelists.py [raw-data-dir]

Here [raw-data-dir] is the path to the directory that contains the cbf files downloaded from CXIDB.

Then we update the parameters in the [make-background] section in config.ini:

[make-background]
num_raw_data = 79992
hot_pix_thres = 1e4
qlen = 500

The execution of make-filelists.py has automatically updated the value of num_raw_data, the total number of data frames. The parameter hot_pix_thres is the threshold value beyond which a pixel is identified as defective and masked out. In our analysis, we assume that the background scatter in each data frame is azimuthally symmetric about the incident X-ray beam, and qlen represents the number of bins that divide the spatial frequency magnitudes with equal spacing for the background estimation. Finally, we execute the commands:

make
mpirun -np [nproc] ./ave_bg ../config.ini > run.log &

to estimate the pixel-wise background values and identify the outlier pixels in each frame, where [nproc] is the number of processors used in the parallel processing.

Lattice parameter estimation

Next, we move to the directory make-powder to estimate the lattice parameters. The parameters in the [make-powder] section in config.ini are:

[make-powder]
min_patch_sz = 2
max_patch_sz = 10
min_num_peak = 3
max_num_peak = 20

A Bragg peak candidate is assumed to contain at least min_patch_sz but no more than max_patch_sz contiguous outlier pixels identified from the diffuse background scatter. Only the data frames with at least min_num_peak but no more than max_num_peak candidate peaks are kept for the later analysis. The enforcement of data sparsity can be removed by making max_num_peak a large integer.

By executing the commands

gcc make-powder.c -O3 -lm -o powder
./powder ../config.ini > run.log

we generate the files frame-peak-count.dat, patch-sz-count.dat, 1d-pseudo-powder.dat and 2d-pseudo-powder.dat. The number of candidate peaks in each data frame is recorded in frame-peak-count.dat. The file patch-sz-count.dat represents the histogram of the size of contiguous outlier pixels, from which we can check if the original choice of max_patch_sz is reasonable. The file 1d-pseudo-powder.dat contains three columns: the spatial frequency magnitudes, the counts of inter-peak distances in reciprocal space in each frame, and the counts of spatial frequency magnitudes of the candidate peaks. Finally, the file 2d-pseudo-powder.dat records the maximum photon count in each detector pixel.

In the analysis of the test dataset, we fit the lattice parameters a = 79.1 Å and c = 38.4 Å by assuming a primitive tetragonal lattice. This choice can be assessed by executing the command

python plot-1d-powder.py

to plot the histograms of the inter-peak distances and the spatial frequency magnitudes of the candidate peaks. For general crystal lattices, the lattice parameters have to be estimated by fitting the histogram of the inter-peak distances. By executing the command

python plot-2d-powder.py

we plot the 2D pseudo-powder pattern to check if the original estimates of the parameters (cx, cy), the beam incidence point on the detector, and (sx, sy, sz), the beam incidence direction, are reasonable. The whole data processing from Mapping detector pixels to here should be rerun if these parameters have to be changed. The values of the estimated lattice parameters are stored as a 3×3 matrix

u[0] v[0] w[0]
u[1] v[1] w[1]
u[2] v[2] w[2]

in the file basis-vec.dat in the directory aux, where u, v and w denote the basis vectors of the primitive unit cell in units of Å. This file should be created by the user for general crystal lattices.

Finding probable orientations

Here we narrow down the number of probable orientations for each frame by directly rotating the centroids of the candidate peaks over all rotation samples in reciprocal space. An orientation is kept for a particular frame if at least min_num_peak candidate peaks overlap with the predicted Bragg peaks. In the [orient-peak] section in config.ini we have the parameters

[orient-peak]
res_cutoff = 4.0
VN = 15
gw = 2.0

The parameter res_cutoff specifies the highest resolution of data in unit of Å that will be used in determining the probable orientations and the low-resolution EMC reconstruction. The parameter VN denotes the number of voxels between the closest Bragg peaks in reciprocal space, and gw defines the radius of a Bragg peak in unit of voxel.

After updating the parameters, we move to the directory make-quaternion to generate the rotation samples with the command

python make-rot-samples.py [num_div]

The integer [num_div] specifies the angular resolution by δθ = 0.944/[num_div]. An angular resolution of at least 2gw·res_cutoff·min_rcell/VN is required in order to not to miss any Bragg peaks. Here min_rcell denotes the minimum peak distance in reciprocal space, with unit of 1/Å. For the test dataset, we have min_rcell = 1/a, and the resulting angular resolution corresponds to [num_div] = 70. This command creates the file c-quaternion[num_div].bin in the directory aux.

Finally, we move to the directory orient-peak, and execute the commands

mpicc mpi-sync-orient-peak.c -O3 -lm -o orient
mpirun -np [nproc] ./orient ../config.ini > run.log &

to find the probable orientations for each frame, where [nproc] is the number of processors to be used. The output file num_prob_orien.dat records the number of probable orientations for each data frame.

Data conversion

We then move to the directory reduce-data to convert data to the format that will be used by the EMC reconstruction. In the [reduce-data] section of config.ini, we have the parameters:

[orient-peak]
nproc = 20
mpi_bgfile = [reduced-data-dir]/Data/mpi-bg_model.bin
mpi_datafile = [reduced-data-dir]/Data/mpi-datafile.bin

Here nproc is the number of processors that will be used for the EMC reconstruction, and [reduced-data-dir] is the directory we used in Initialization. The files mpi_bgfile and mpi_datafile store the background estimates and photon counts of the data frames that will be input to the EMC algorithm, respectively. In order to reduce the time spent on reading data, the photon counts are stored as short integers. The frames with more than max_num_peak identified peaks or no probable orientations found in Finding probable orientations will be excluded.

We generate mpi_bgfile and other auxiliary files by executing the commands

gcc reduce-data.c -O3 -lm -o reduce-data
./reduce-data ../config.ini > run.log

The order of the frames is rearranged to balance the work loads between the nproc processors based on the number of probable orientations per frame, and this information is stored in the file reduced-data_id.dat. The file mpi_datafile is generated using the commands

make
./wr-data ../config.ini >> run.log &

Expansion matrix calculation

Since the crystal diffraction signals are concentrated in the Bragg spots, we can speed up the expand (E) step of the EMC reconstruction by precalculating a look-up table that records the mapping between the Bragg peaks and the detector pixels over all rotation samples. This can be done by moving to the directory make-Ematrix and executing the commands

mpicc mpi-make-Emat.c -O3 -lm -o emat
mpirun -np [nproc] ./emat -low ../config.ini > run.log &

Here [nproc] is the number of processors to be used. The mapping is stored in the files r2peak_file and peak2r_file, whose locations are specified in the [make-Ematrix] section of config.ini.

Skipping data reduction

For those who would like to try an EMC reconstruction immediately, we have provided the reduced data generated from the test dataset following the data processing procedures described above. After creating the directory skip-data-reduction, the reduced data can be downloaded from the website

http://cxidb.org/data/82/reduced-data

The downloaded files are moved to their appropriate locations by executing the command

python distribute-files.py [work-dir] > dist.log &

to get ready for the EMC reconstruction. Here [work-dir] denotes the path to the working directory, EMC-for-SMX.