Processing math: 100%
Advanced Search+
Fufeng WANG, Tianhang MENG, Zhongxi NING, Ximing ZHU. Influence of cyclic ignition and steady-state operation on a 1–2 A barium tungsten hollow cathode[J]. Plasma Science and Technology, 2024, 26(12): 125503. DOI: 10.1088/2058-6272/ad7a57
Citation: Fufeng WANG, Tianhang MENG, Zhongxi NING, Ximing ZHU. Influence of cyclic ignition and steady-state operation on a 1–2 A barium tungsten hollow cathode[J]. Plasma Science and Technology, 2024, 26(12): 125503. DOI: 10.1088/2058-6272/ad7a57

Influence of cyclic ignition and steady-state operation on a 1–2 A barium tungsten hollow cathode

More Information
  • Booming low-power electric propulsion systems require 1–2 A hollow cathodes. Such cathodes are expected to go through more frequent ignitions in the low orbit, but the impact of cyclic ignitions on such 1–2 A barium tungsten hollow cathodes with a heater was not clear. In this study, a 12,638-cyclic ignition test and a 6,000-hour-long life test on two identical cathodes were carried out. The discharge voltage of the cathode and the erosion of the orifice after cyclic ignition were all larger than that of the cathode after stable operation. This indicated that the impact of cycle ignition on the discharge performance of a low current BaO-W cathode with a heater was higher than that of stable operation. The results of the ion energy distribution function measured during the ignition period indicated that the main reason for the orifice expansion was ion bombardment. Therefore, it was necessary to pay attention to the number of ignitions for the lifetime of this kind of cathode.

  • A tokamak is designed to confine high-pressure plasma using a strong magnetic field, with the ultimate goal of creating conditions for net energy output. However, this also makes it vulnerable to various instabilities and events. Disruption is one kind of severe event in which magnetic confinement in the tokamak is suddenly lost [1]. The resulting release of thermal energy and electromagnetic forces can cause significant damage to the plasma-facing components in larger tokamak devices [2], such as the International Thermonuclear Experimental Reactor (ITER) [3]. Therefore, it is crucial for future reactor devices to mitigate this and avoid disruption.

    Several methods have been explored to mitigate the effects of disruption, including massive gas puffing [4], pellet injection [4, 5] and so on. To trigger the mitigation system in time and avoid unnecessary termination, a real-time risk assessment of impending disruption is essential. One approach involves making predictions by theoretical models based on first principles, as the physical causes of disruptions have been studied. However, the complexity of the causes of disruption and the non-linearity of its evolution still make it impracticable to give a quantitative prediction of impending disruption based on a first-principles approach.

    The absence of a comprehensive theoretical model based on first principles has led to another data-driven approach, namely neural networks, which have already been overwhelmingly successful in computer vision [6] and natural language processing [7]. Consequently, data-driven neural networks have been widely used for prediction of disruption in many tokamaks, such as JET [8, 9], ASDEX [10, 11], DIII-D [12, 13], Alcator C-mod [14], JT-60U [15, 16], EAST [17, 18] and J-TEXT [19, 20], and the immediacy and precision of neural networks for disruption prediction have been validated.

    The initial employment of a multi-layer perceptron (MLP) [21] neural network for disruption prediction on the DIII-D tokamak marked a significant milestone [12]. Subsequently, convolutional neural networks (CNNs), which excel at extracting spatial features such as temperature and density profiles [13, 17], and recurrent neural networks (RNNs) and their variants, well-suited for processing time-series signals [18, 22, 23], have been widely utilized for disruption prediction in many tokamaks. In recent research, substantial attention has been directed towards handling multimodal and high-dimensional inputs, as well as state-of-the-art models. However, the importance of labels and annotation methods, despite their paramount significance, has been overlooked.

    The remarkable predictive capabilities of neural networks are critically reliant on the quality of labels. Accurate ground truth labeling is essential to ensure the effectiveness and reliability of the predictive model, as it serves as the foundation for training. However, while a model is trained on data with inaccurate labels, the model learns the wrong patterns present in the incorrect data distribution. Consequently, it has unrealistic expectations about its own performance due to the influence of mislabeled data. These issues combined can lead to the problem of overfitting, where a model performs well on the training dataset but fails to generalize on the test dataset, and overconfidence, where a model yields overly confident yet incorrect predictions, resulting in the poor performance on unseen or test datasets.

    In disruption prediction it is challenging to accurately evaluate the likelihood or risk of a disruption at any given moment during discharge. In classical annotation methods, a naïve assumption has been used to simplify the annotation process, which posits that the value of risk assigned to a given moment is determined solely by its proximity to disruption. In other words, the closer a moment is to disruption, the higher the assigned risk. Typically, a step function is utilized that assumes the same threshold time for all disruptive discharges, irrespective of the underlying causes [10, 13, 18, 2224]. Once a given time instant has passed the same threshold time, it is assigned a 100% probability of disruption. However, this naïve assumption neglects the complexity of the causes of disruption and the effects of various plasma parameters on disruption, leading to overfitting and overconfidence.

    These challenges have been recognized in previous research, and a generative topographic mapping method has been utilized on JET to address the issue caused by the classical annotation method [25, 26]. The issues of overconfidence and overfitting become even more problematic when classical labeling is applied across different machines to transfer the knowledge from existing machines to ITER. Hence, there is an urgent need to develop accurate ground truth labeling that can reflect the objective likelihood of a disruption at a given time instant, taking into consideration the current plasma parameters.

    An improved training framework is proposed in this paper to overcome the problem of overfitting and overconfidence introduced by inaccurate labeling, thereby improving the accuracy of the model for the prediction of disruption. This new framework is inspired by the method of knowledge distillation [27], and consists of a two-stage training process. In the first stage, a simple but efficient MLP neural network is trained to estimate the risk of an impending disruption using the classical annotation method as the teacher model. The teacher model can provide a more reliable estimate of the risk, and its outputs serve as the basis for the second stage of training. In the second stage, the outputs given by the teacher model are corrected by classical annotation methods and used as the new target to supervise the training of the student model. This allows the student model to overcome the problem of overfitting and overconfidence by learning from the teacher model while also taking into account the corrected labeling provided by the classical method.

    The outline for the rest of this paper is as follows. Section 2 introduces the disruption database in the EXL-50 tokamak and the MLP model used. Section 3 presents the details of the disruption predictions trained by the MLP neural network, and provides a preliminary analysis of the predictive results. In section 4, we propose a new training framework, and compare its predictive performance with that of classical annotation methods. Finally, section 5 summarizes the advantage of the new framework, and discusses its potential implications.

    The EXL-50 device is a medium-sized spherical tokamak constructed in China with a fully non-inductive current drive [28]. It has been in operation since the middle of 2019. To date, several campaigns of experiments have been conducted on EXL-50, and it is able to routinely run discharges with plasma current exceeding 100 kA and current flattop periods surpassing 2 s [29]. Despite its impressive operational performance, disruptions have occasionally been observed during these campaigns, providing a large database for disruption prediction research.

    In order to estimate the risk of impending disruption in advance, only real-time measured diagnostics will be utilized in this work. The signals used for the predictive model are presented in table 1, along with their mean value, standard deviation and sampling rate. Considering the availability of diagnostics in the new-built device and the need for real-time capability, ten signals were used as input features to predict disruptions, most of which are relevant to the underlying causes of disruptions, while others are global quantities. The sample rate of different signals varied considerably, ranging from 300 Hz to MHz, and a uniform sample rate was required for the training process. A sample rate of 1 kHz was used for the input signals as a trade-off between the promptness of alarming and calculation efficiency, and was also commonly used in previous work [17, 22]. Since the ten signals were not sampled simultaneously, cubic spline interpolation was used to resample the data to 1 kHz. The input data of the ith signal were standardized by the following equation:

    Table  1.  Ten signals used as the input features of model.
    Quality Symbol Mean Standard deviation Sample rate
    Radiation at the core (a.u.)AXUV0150.210.52100 kHz
    Radiation at the edge (a.u.)AXUV0271.73.44100 kHz
    Horizontal field (T)BR0.00020.0002100 kHz
    Vertical field (T)BV0.030.013100 kHz
    Horizontal displacement (a.u.)PH−0.220.2341 kHz
    Vertical displacement (a.u.)PV−0.020.11 kHz
    Plasma density (1017 m−2)ne9.26.51 MHz
    Plasma current (kA)IP78.539.5100 kHz
    Loop voltage (V)LoopV−0.040.12330 Hz
    Gas pressure (a.u.)GAS_P2.271.110 kHz
     | Show Table
    DownLoad: CSV
    xi=xiμiσi, (1)

    using the mean value μi and standard deviation σi to ensure the input data have zero mean and unit variance, which could improve the efficiency and effectiveness of the training process.

    In this study, a discrete time instant approach rather than time series was adopted to predict the risk of disruption, based on the simplified assumption that it is mostly determined by the current plasma parameters and operational state at a specific time instant. However, we should bear in mind that this simplified assumption might ignore the temporal evolution of features in time-series data, which could potentially be captured by the RNN model. To train the predictive model, an efficient MLP neural network was chosen. We define the maximum of ˙Ip (the time derivative of the plasma current) as the time instant at which the plasma current begins to quench. The primary types of disruption in EXL-50 are vertical displacement event (VDE) disruption and impurity radiation disruption. Disruptions caused by VDE are characterized by a significant change in vertical displacement occurring several milliseconds to tens of milliseconds before the disruption. In this database, we have employed a global estimate of 300 ms before current quench in the neural network to identify precursors or major signal excursions in the pre-disruptive stage.

    The disruption database utilized in this study includes 2568 discharges, of which 1185 are disruptive while the rest are non-disruptive. To provide a robust evaluation of how well a predictive model will perform compared with a single train–test split we utilize a six-fold cross-validation procedure. We use discharge-wise random division to split the dataset into an initial training dataset and a test dataset, and further randomly split the initial training dataset into K = 6 subsets. The models are then trained on each combination of five subsets, serving as the actual training dataset to determine the values for the weights and biases of the models, while the corresponding held out subset is used as the validation dataset to choose the best predictive model. In each training run, the test dataset remains fixed and is consistently used to provide an unbiased evaluation of the model’s performance.

    We label the six dataset combinations as K1 to K6; these are used in the six-fold cross-validation procedure. Specifically, the dataset combination K1 is chosen as the reference dataset combination, and its distribution of disruptive and non-disruptive discharges within the training, validation and test dataset is shown in the table 2 (which is used as a reference for further study). It is important to note that only the test dataset remains constant, while the training and validation datasets vary for the other five dataset combinations. In the training process, the 545400 samples from 1818 discharges in the training dataset of dataset combination K1 are randomly shuffled to reduce variance and accelerate convergence.

    Table  2.  The distribution of disruptive and non-disruptive discharges in training, validation and test datasets for dataset combination K1.
    Category Train Validation Test Total
    Disruption8331851671185
    Non-disruption9851792191383
    Total18183643862568
     | Show Table
    DownLoad: CSV

    The MLP is a traditional fully connected neural network with multiple hidden layers, in which the neurons are fully connected with neurons in the adjacent layers.

    The single hidden layer in the MLP functions as a combination of all neurons in this layer. To some extent, the neurons in the hidden layer act as filters to extract the useful features beneficial to disruption identification, such as the precursors or major signal excursions. By stacking up several layers, these key features can be automatically extracted from the raw input data, and the last hidden layer compiles these features from previous layers to yield the output.

    In a classification task, the model applies the softmax activation function on the logit vector of the penultimate layer to compute the probabilities of the different classes, also referred to as soft label or soft output, such as [0.1, 0.9], indicating that this time instant has a 90% probability of being near the disruption and a 10% probability of being far away from the disruption, and a decision threshold is used to map the soft output of binary classification to a binary category. The hard label [1, 0] or [0, 1] is one-hot encoding transformed from the binary label predefined by experts, indicting a 0% or 100% probability of being disruptive, respectively. The cross-entropy loss function has been used to measure the difference between the soft output and the hard label

    Entropy=ni=1p(xi)log(q(xi)), (2)

    where p is the hard label and q is the probability, or soft label. A backpropagation algorithm is utilized to update the weights in each layer based on the error, making the output match the targets. In this sense, the weights of neurons in each hidden layer, i.e. coefficients of the filters, are ‘learned’ from the large dataset. That is why we call the technique data driven. The MLP model architecture used in this study and the hyper-parameters are presented in table 3.

    Table  3.  The hyper-parameters used in the MLP model.
    Parameters Values
    Number of dense layers3
    Number of neurons in Dense 1128
    Number of neurons in Dense 264
    Number of neurons in Dense 332
    Batch size512
    Learning rate0.001
    Activation functionReLu
    OptimizerAdamoptimizer
     | Show Table
    DownLoad: CSV

    As a supervised learning algorithm, each sample in the database must be labeled with a target value by expert annotation. In classical annotation methods for disruption prediction, the target value of disruption risk at a given time instant in disrupted discharges is determined by its proximity to disruption, rather than physical quantities at that instant. This approach relies on a naïve assumption, and a step function is typically used to label the target value in a disrupted discharge [10, 13, 18, 2224], as shown in equation (3)

    y(t)={0  tdist>tthreshold1  tdist<tthreshold, (3)

    where tdis is the time instant of disruption, t is the time instant of target and tthreshold is the warning time. A binary output (0 or 1) is used to indicate whether a given time instant corresponds to the stable stage or the pre-disruptive stage. At non-disruptive discharges, the network target at every time instant is labelled as 0. Although the value of tthreshold should ideally be adjusted depending on different types of disruption to achieve more accurate labeling, the classical annotation method simply sets a uniform threshold value for all disrupted discharges. In this method, the value of warning time is determined by maximizing the performance of the predictive model. To enable further comparison with the predictive results trained by the new training framework, we used the classical annotation method to supervise the training, and selected a time threshold of 150 ms in advance of the disruption for all disruptive discharges.

    Evaluation of the performance of a predictive model in disruption prediction differs from that of a typical classification task. In a typical classification task, the performance is evaluated on an instance-by-instance basis, where all instances are considered independently. In the case of disruption prediction, however, evaluation of the predictive model is determined on a shot-by-shot basis, where the instances within a single discharge are treated as a unit to evaluate the accuracy of prediction. For disrupted discharges, a true positive is recorded when the model’s output exceeds the pre-set threshold value at any time instant before the warning time, while a false negative is recorded otherwise. In this work, the warning time is set as 30 ms before the disruption, which is also the minimum time needed to successfully trigger the mitigation system in future devices [30]. In non-disrupted discharges, a true negative is recorded if the model’s output remains below the threshold before the warning time, indicating that the alarm is not triggered, or a false positive is recorded otherwise.

    Two economic operating metrics, the true positive rate (TPR, also known as the recall rate) and the false positive rate (FPR), are commonly used to evaluate the performance of the predictive model at different threshold settings

    truepositiverate=truepositivetruepositive+faslenegative, (4)
    falsepositiverate=falsepositivefalsepositive+truenegative. (5)

    TPR is defined as the ratio of true positives to all disrupted discharges in the dataset, while FPR is defined as the ratio of false positives to all non-disrupted discharges in the dataset. By adjusting the threshold for triggering an alarm to a higher value, it allows more disruptions to be detected in advance, but may also increases the risk of false alarms. This trade-off between TPR and FPR can be visualized by plotting the TPR against the FPR at various threshold settings, resulting in a receiver operating characteristic curve (ROC curve). The two-dimensional area under the ROC curve (AUC) is a popular metric for evaluating the quality of a predictive model irrespective of what threshold is chosen. A value of 1 describes a perfect model, while a value of 0.5 indicates a random classifier. The optimal point on the ROC curve is (0, 1), where 100% TPR is achieved with 0% FPR.

    Additionally, precision, defined as the ratio of true positives to all the discharges predicted as disrupted, is another important metric to consider

    precision=truepositivetruepositive+faslepositive. (6)

    Along with precision, recall, also known as TPR, are two metrics that need to be traded-off. The optimal threshold could be determined by optimizing the Fβ score, which takes into account both precision and recall

    Fβ=(1+β2)recall×precisonrecall+β2×precision, (7)

    where β is a real factor chosen such that recall is considered β times as important as precision. In this paper, the F0.5 score is used, which gives more weight to precision than to recall.

    We labelled the first model trained on dataset combination K1 as model T1, and the cross-entropy loss of training and validation datasets for model T1 versus training epochs have been shown in figure 1(a). The cross-entropy loss of the training dataset continuously decreases in the first 50 epochs, after which it gradually converges. In contrast, the cross-entropy loss of the validation dataset decreases to its smallest value around epoch 15, but undergoes a rapid increase in subsequent epochs. This phenomenon, known as overfitting, occurs when the model performs better in classifying the training dataset than the test dataset. The discrepancy arises due to a substantial number of instances in which the plasma remained stable but was mislabeled as pre-disruptive, or conversely where disruptions approached but were erroneously labeled as stable. The inaccurate labels can have detrimental effects on the performance of the predictive model, as the model learns wrong patterns present in the data with inaccurate labels, leading to issues of overfitting and overconfidence. It is notable that, in this scenario, overfitting cannot be effectively prevented by early termination of training. This is evident as the AUC value of the validation dataset achieved its maximum value of 0.940 at epoch 41, while the AUC value of test dataset begins to show signs of overfitting after epoch 27.

    Figure  1.  (a) Cross-entropy loss of the training and validation datasets, along with the AUC value of the validation and test datasets versus training epochs of model T1 with the classical annotation method. (b) Average AUC value with ±1 standard deviation intervals among the six-fold validation dataset and the test dataset for 12 trained models with the classical annotation method. The dashed line indicates epoch 26, at which the average AUC value of the validation dataset achieved its maximum value of 0.937 with a standard deviation of 0.013. Additionally, the average AUC value of the test dataset at this epoch is 0.934, with a deviation of 0.003.

    To minimize the potential errors introduced by individual training instances, we conducted additional training runs on dataset combination K1. This involved initializing the weight matrices with different random seeds, leading to the creation of model T2. Similarly, we could obtain 12 models, labeled from T1 to T12, across dataset combinations from K1 to K6 using the same approach. Figure 1(b) illustrates the average AUC value and standard deviation intervals computed during the training process of models from T1 to T12, and the dashes line indicates that the average AUC value of the validation dataset achieved its maximum value of 0.937 with a standard deviation of 0.013 at epoch 26. Meanwhile, the average AUC value of the test dataset also achieved its maximum value of 0.934, with a deviation of 0.003, at the same epoch 26. However, the mean AUC value of the test dataset begins to decline after epoch 26, while the mean AUC value of the validation dataset starts to show obvious signs of overfitting around epoch 100.

    We utilized models T1 to T12 on both validation and test datasets, generating their average ROC curves along with standard deviation intervals, as depicted in figure 2. The mean AUC value of the test dataset calculated from these 12 models is 0.932, with a deviation of 0.007. This value is slightly lower than the mean AUC value of 0.934 depicted in figure 1(b). This is because the mean AUC values of the validation and test datasets achieve their maximum at the same epoch. However, in certain individual models the AUC value of the test dataset already exhibits clear signs of overfitting while the AUC value of the validation dataset achieves its maximum value, as is evident in figure 1(a).

    Figure  2.  Average ROC curve of models T1 to T12 for the K-fold validation dataset and test dataset as well as their standard deviation intervals; the two balance points have been chosen by the F0.5 score of model T1. The red star marks the point with a FPR of 6.7% and a TPR of 82.2%, while the blue star marks the point with a FPR of 10% and a TPR of 82.6%.

    Additionally, the optimal balance points on the ROC curve for the validation and test datasets are marked specifically for model T1. In this model, the optimal threshold value of 0.62, which maximizes the F0.5 score to 0.904, resulted in the best performance on the validation dataset with a FPR of 6.7% and a TPR of 82.2%. However, on the test dataset, the model exhibited a slightly higher FPR of 10% while maintaining a similar TPR of 82.6%. The corresponding confusion matrix of the training dataset, validation dataset and test dataset is shown in table 4.

    Table  4.  The confusion matrix of the training dataset, validation dataset and test dataset of model T1 with the classical annotation method (TP, true positive; FP, false positive; FN, false negative; TN, true negative).
    Threshold = 0.62Prediction disruption (1)Prediction non-disruption (0)
    Training dataset label: 1708 (TP)125 (FN)
    Training dataset label: 051 (FP)934 (TN)
    Validation dataset label: 1152 (TP)33 (FN)
    Validation dataset label: 012 (FP)167 (TN)
    Test dataset label: 1138 (TP)29 (FN)
    Test dataset label: 022 (FP)197 (TN)
     | Show Table
    DownLoad: CSV

    As discussed in the previous section, there is an urgent need to develop an accurate ground truth labeling method that can reflect the objective disruption risk. Although hard labels do not provide an objective representation of risk, the capacity of the neural network to learn the mapping from input features to hard labels enables it to generate output values that more closely approximate the ground truth value compared with the hard labels themselves. In fact, the outputs of the neural network represent a more detailed and nuanced views of model uncertainty in the likelihood of a disruption occurring at a given time instant.

    Figure 3 shows an example of the prediction result for discharge #15600 on EXL-50 as generated by the model T1. The bottom panel of the figure displays the predicted likelihood or risk of a disruption, which increases rapidly from t = 3.62 s, coinciding with the increase in the radiation signal. This observation highlights a strong dependence of output on radiation, which indicates that the soft labels provide a more refined representation of disruption risks, taking into account the effects of signal features. This highlights the potential for using these outputs as ground truth labels to supervise further training. The soft outputs generated by the model during the second training are more representative of the ground truth values of disruption risks, and this approach can be viewed as a process of distilling refined knowledge repeatedly, leading to further improvements in the accuracy of the predictive model.

    Figure  3.  Example prediction on discharge #15600 from EXL-50. The top three panels show the normalized input signals, and the bottom panel shows the model output as a function of the input signals.

    In the field of neural networks, knowledge distillation [27] is a well-established technique used to transfer the knowledge contained within a larger, more complex model (referred to as the teacher model) to a smaller and simpler model (referred to as the student model). The output distributions of the teacher model provide insight into how the teacher model represents knowledge, and by training the student model to mimic these output distributions, the student model is able to learn from the teacher model’s internal representations and improve its own performance. This process of knowledge distillation has proven to be a highly effective method for transferring knowledge and improving the performance of neural networks.

    Inspired by the concept of knowledge distillation, an improved teacher–student training framework has been proposed to deal with the issues of overfitting and overconfidence that arise from classical annotation methods. The new training framework consists of two distinct training processes for neural networks, as illustrated in figure 4. Specifically, both the teacher model and the student model utilize a MLP neural network with identical architecture as discussed in the previous section. During the first training process, hard labels are used as the target to train the teacher model, which generates soft label outputs. These soft labels reflect a more nuanced view of the disruption risk, and assign corresponding values to the risk based on varied input features. Here we recall the cross-entropy loss function

    Figure  4.  Schematic diagram of the new training framework. The soft labels of the teacher model are used to supervise the training of the student model with the corrections of hard labels in the false positive and false negative cases. In the training of student model, the cross-entropy loss is still used to adjust model weights.
    Entropy=ni=1p(xi)log(q(xi)). (8)

    When using soft labels, p is not restricted to 0 or 1 but represents the soft labels predicted by the teacher model, and q represents the probabilities predicted by the student model. During the second training, the soft labels are then used to supervise the student model, leading to more accurate predictions by incorporating more nuanced information about the probabilities.

    However, it is important to notice that the teacher model may not always make correct predictions, and incorporating incorrect predictions may degrade the performance of the predictive model. To mitigate this issue, only the correct predictions from the teacher model are used to supervise the student model. As illustrated in equation (8), for true positive or true negative discharges the soft labels given by the teacher model are used as the new target. In contrast, for false positive or false negative discharges, the hard labels given by classical labeling are still used as the target since they can provide a correct, although rough, distinction between the stable stage and the pre-disruptive stage. This new training framework with correction mechanism allows the student model to learn from the correct distinctions made by the hard labels, while also incorporating the more nuanced information provided by the soft labels for true positive and true negative discharges

    ynew={hardlabeliffporfnsoftlabeliftportn. (9)

    To evaluate the effectiveness of the proposed training framework, 12 revised student models were trained based on 12 teacher models T1 to T12. We take teacher model T1, whose confusion matrix is shown in table 4, as an illustrative example to show the training process of the corresponding revised student model. This teacher model correctly classified 708 positive discharges and 934 negative discharges, and the soft labels generated by the teacher model were used as the new target to supervise the student model. However, for the 125 positive discharges and 51 negative discharges misclassified by the teacher model, the original hard labels were still used as the target for these instances. These two types of targets were used together to supervise the training of the revised student model, and the second training process is illustrated in figure 5(a). The loss of the validation dataset decreased almost consistently along with the loss of training dataset until convergence, indicating that the disruptive risk of samples in the training dataset was accurately labeled. Furthermore, the AUC value of the validation dataset gradually increases to 0.95, while the AUC curves of the test dataset during training remain smooth and stable.

    Figure  5.  (a) Cross-entropy loss of the training and validation datasets, as well as the AUC value of the validation and test datasets versus training epochs of the revised student model which learned from the teacher model T1. (b) Average AUC value with the ±1 standard deviation interval amongst the six-fold validation dataset and the test dataset of 12 revised student models. The dashed line indicates epoch 149, at which the average AUC value of the validation dataset achieved its maximum value of 0.949 with a standard deviation of 0.012. Additionally, the average AUC value of the test dataset at this epoch is 0.94, with a deviation of 0.005.

    The average AUC values with standard deviation intervals of the 12 revised student models on both the validation and test datasets are shown in figure 5(b). The mean AUC values of both datasets exhibit minimal and stable fluctuations, resembling nearly horizontal lines after 50 epochs, which demonstrate that the proposed training framework effectively mitigates the issue of overfitting and enhances the performance of the predictive model. Notably, the variance in the test dataset is considerably reduced due to the fixed test dataset, whereas the validation dataset varies across the 12 models.

    To assess the performance difference between the teacher models with classical annotation methods and the revised student models, the mean AUC values as well as their standard deviation intervals versus the training epoch for both validation datasets and test datasets are plotted in figure 6. Additionally, 12 more student models that only utilize the soft labels without the correction mechanism were also trained based on the teacher models T1 to T12.

    Figure  6.  (a) The average AUC values with the ±1 standard deviation interval for the K-fold validation dataset versus training epochs in three different cases: teacher model, student model and revised student model. (b) The average AUC values with the ±1 standard deviation interval for the test dataset versus training epochs in three different cases. The dashed lines indicate the epochs at which the average AUC values of the validation dataset achieved their respective maximum for each case, and the AUC values for validation dataset (a) and test dataset (b) at these specific epochs are shown in the legends for each case.

    The results show that the AUC curves of two student models on both validation and test datasets are more stable and smoother than the teacher models, which tend to overfit during the training. The dashed lines indicate the epochs at which the average AUC values of the validation dataset achieved their respective maximum for each case. The average AUC value of the student models without a correction mechanism on test dataset at epoch 48 is 0.934 with a standard deviation of 0.006; this performance is no better than that of the teacher models. This may be because the student models learn not only the correct predictions but also the incorrect predictions from teacher models. However, by incorporating corrections from hard labels for these incorrect predictions made by teacher models, the AUC values of the revised student model are significantly higher than other models for both the validation and test datasets.

    To make a comprehensive comparison, the test dataset was evaluated using the 12 teacher models and the 12 revised student models. Average ROC curves on the test dataset, along with standard deviation intervals, were generated based on the predictions from these teacher and revised student models. Figure 7 illustrates that the predictive performance of the revised model is significantly superior to that of the teacher model, displaying a higher true positive rate and lower false positive rate at almost every threshold. The mean AUC value of test dataset calculated from the revised student models is 0.94 with a deviation of 0.005, showcasing identical performance to the revised student case depicted in figure 6(b), where the mean AUC value of the revised student models for test dataset is also 0.94 with a deviation of 0.005 at epoch 149. Meanwhile, the mean AUC value of the test dataset calculated from 12 teacher models is 0.932, with a deviation of 0.007, slightly lower than the mean AUC value of 0.934 depicted in figure 6(b). This illustrates the stability of the training process for revised student models when compared with teacher models. The position of the red star on the red curve, with a true positive rate of 88.0% and a false positive rate of 9.6%, maximizes the F0.5 score of 0.88 for the revised student model which learns from the teacher model T1, indicating the optimal trade-off between recall and precision.

    Figure  7.  Average ROC curve with standard deviation intervals of the teacher and revised student models for the test dataset; the two balance points were chosen by the F0.5 score for teacher model T1 and the revised student model which learns from teacher model T1. The red star marks the point with a FPR of 9.6% and a TPR of 88.0%, while the blue star marks the point with a FPR of 10% and a TPR of 82.6%.

    These results demonstrated the effectiveness of the proposed new training framework in improving the predictive performance of the revised student model compared with other models, showing superior generalization capacity. In conclusion, the revised student model exhibits two advantages:

    (1) Improvement of predictive performance. The soft label provides a more detailed and nuanced view of model uncertainty in the likelihood of a disruption, which encourages the model learning the underlying patterns in the data rather than fitting misclassified data points in the training dataset.

    (2) Selection of a model with better generalization capacity. The soft labels provide a smoother and more continuous target for the model to optimize towards, making more stable and smoother learning curves.

    The feasibility of a disruption predictor depends not only on the TPR and FPR but also on the advance warning timing. The advance warning time distributions of both the teacher and revised student models on the test dataset are shown in figure 8, where the fraction of detected disruption at time X represents the fraction of all disruptions detected in the test dataset at least X ms in advance. It is noteworthy that the majority of disruptions predicted by the teacher and revised student models are forewarned more than 30 ms in advance. There is no significant difference observed in the fraction distribution of detected disruptions between the revised student models and the teacher models.

    Figure  8.  Average cumulative warning time distributions of the teacher model and the revised student model along with standard deviation intervals for the test dataset; the approximate warning time needed for mitigation (30 ms) is highlighed with a black dashed line.

    Another type of student model has been trained to understand whether the performance improvement of the revised student model is primarily due to label smoothing brought by soft labels or the capacity to identify the onset of a disruptive stage learned from the teacher model,. This model was specifically designed to acquire the capacity of the teacher model for identifying the transition from stable to pre-disruptive stage, using hard labels instead of soft labels of teacher models T1 to T12. That is to say, once the outputs of one time instant generated by the teacher model exceed the threshold value, all subsequent time instants are labeled as 1, and the boundary between 0 and 1 indicates the transition from stability to a pre-disruptive plasma state. Similar to the revised student model, if the discharges are misclassified by the teacher model the target would still use the original hard labels.

    Figure 9(a) illustrates the training process of the student model that solely learns from the hard labels of teacher models. The loss of the validation dataset decreased until convergence, almost consistently along with the loss of training dataset, indicating that the samples in the training dataset were nearly correctly classified. However, the trends of AUC values on the validation dataset and test datasets exhibit inconsistencies, and they failed to reach the maximum performance at the same epoch. The mean AUC values with standard deviation intervals are also plotted in figure 9(b), which shows high variance and strong fluctuations in the AUC value. The average AUC value of 0.915 at epoch 9 on the test dataset is far below that of the revised student model, demonstrating that soft labels play a significant role in enhancing predictive performance.

    Figure  9.  (a) Cross-entropy loss of the training and validation datasets, as well as the AUC value of the validation and test datasets versus training epochs of the student model which only learns from hard labels of teacher model T1. (b) Average AUC value with ±1 standard deviation intervals for the six-fold validation dataset and test dataset of the 12 student models which only learn from hard labels of teacher models T1 to T12. The dashed line indicates epoch 9, at which the average AUC value of the validation dataset achieved its maximum value of 0.938 with a standard deviation of 0.012. Additionally, the average AUC value of the test dataset at this epoch is 0.915, with a deviation of 0.008.

    Compared with the student model supervised by soft labels, the hard labels ‘0’ and ‘1’ encourage the largest possible logit gaps at the penultimate layer to be fed into the softmax function. Intuitively, these large logit gaps combined with the bounded gradient will make the model less adaptive and too confident in its predictions. By using soft labels, the student model is encouraged to have softer decision boundaries and becomes less certain about its predictions, resulting in a more balanced and stable training process. Additionally, the correction mechanism also plays a crucial role in enhancing the performance of the revised student model, allowing it to exhibit superior generalization capacity. In summary, the performance improvements of the revised student model benefit from the coupling effect of soft labels and the correction mechanism.

    Accurate and reliable ground truth labeling is a fundamental requirement for training well-performing predictive models. However, the classical annotation method neglects the significant impact of various plasma parameters on disruption risk, and subjectively assumes the value of risk at a given moment depend on its proximity to disruption. The predictive result obtained using classical annotation methods in EXL-50 indicates the problem of overfitting and overconfidence, as the model learns incorrect patterns present in the data with inaccurate labels. Therefore, it is necessary to develop an accurate ground truth labeling method that can reflect the objective risk of disruption.

    In this paper, we proposed a new training framework to overcome the limitations of the classical annotation method for disruption prediction. Inspired from the concept of knowledge distillation in deep learning, this improved framework utilizes the knowledge-rich predictions generated by a teacher model to supervise the training of a student model. Besides, in this new training framework, the soft labels produced by the teacher model, in conjunction with hard labels assigned to misclassified instances, provide a more accurate and reliable ground truth labeling than the classical annotation method.

    The predictive results obtained using this newly proposed framework on the same test dataset demonstrated significant improvements in predictive performance compared with the classical annotation method. This was evident in the higher AUC value achieved for the test dataset, as well as the reduced fluctuations in the AUC values of the test dataset. These results strongly indicate that this new framework can significantly increase the generalization ability of the predictive model, and effectively address the issues of overfitting and overconfidence that often arise from classical annotation methods. Furthermore, this new training framework provides a possible method to deal with more complex labeling across different machines.

    In disruption predictions for future reactors like ITER, the disruptive discharges from such reactors themselves are insufficient for training since they are not tolerant to the effects of disruption. To overcome the insufficient disruptive class of ITER, one possible approach is to learn the disruptive patterns from existing tokamaks and transfer the knowledge to the ITER case using a limited number of disruptive discharges. Cross-device training has been explored in several works [2224, 31]. However, labeling across devices is much more challenging due to the varied disruption triggers and device conditions. The classical annotation method applied across devices can exacerbate problems of overfitting and overconfidence, especially when a simple step function is utilized to represent disruption risk across different devices with varying conditions.

    To overcome the problems of overfitting and overconfidence that arise from classical labeling across different machines, a technique called cross-machine label smoothing (CMLS) was proposed in a previous paper [23] to take into account the uncertainty by manually modifying the target value with the hyper-parameter smoothing parameters. Label smoothing has been demonstrated to prevent overconfident predictions by encouraging small logit gaps and the tolerance of uncertainty [32]. However, one limitation of CMLS is that the label smoothing parameters are not specific to individual instances, but are shared among all discharges within the same machine.

    In contrast, the improved training framework proposed in this paper provides an instance-specific label smoothing method that assigns values to disruption risk based on the patterns learned by the model. Moreover, it has been demonstrated that the improvement in model performance benefits from the coupling effect of soft labels and the correction mechanism. The soft labels trained across machines reflects a more detailed and nuanced view of model uncertainty in the likelihood of a disruption, taking into account the varied signal features and machine conditions. Overall, this improved training framework offers a promising solution to effectively address the challenges of overfitting and overconfidence, and might have the potential to deal with more complex labeling across different machines.

    This work was supported by National Natural Science Foundation of China (Nos. 12175277 and 11975271), and the National Key R&D Program of China (No. 2022YFE03050003).

    This work was supported by the Key Projects of School-enterprise Joint Fund (No. U22B20120) and the National Science Fund for Distinguished Young Scholars (No. 52107141).

  • [1]
    Serjeant S, Elvis M and tinetti G 2020 Nat. Astron. 4 1031 doi: 10.1038/s41550-020-1201-5
    [2]
    Watanabe H, Cho S and Kubota K 2020 Acta Astronaut. 166 227 doi: 10.1016/j.actaastro.2019.07.042
    [3]
    Venkatesan A et al 2020 Nat. Astron. 4 1043 doi: 10.1038/s41550-020-01238-3
    [4]
    McDowell J C 2020 Astrophys. J. Lett. 892 L36 doi: 10.3847/2041-8213/ab8016
    [5]
    Conversano R W et al 2017 J. Propul. Power 33 975 doi: 10.2514/1.B36230
    [6]
    Ning Z X et al 2019 Plasma Sci. Technol. 21 125402 doi: 10.1088/2058-6272/ab4364
    [7]
    Meng T H, Ning Z X and Yu D R 2020 Plasma Sci. Technol. 22 094001 doi: 10.1088/2058-6272/ab7902
    [8]
    Iqbal M et al 2004 Vacuum 77 19 doi: 10.1016/j.vacuum.2004.07.066
    [9]
    Xu J P et al 2016 Vacuum 134 83 doi: 10.1016/j.vacuum.2016.09.025
    [10]
    Garulli A et al 2011 J. Guid. Control Dynam. 34 1683 doi: 10.2514/1.52985
    [11]
    Olivieri L and Francesconi A et al 2020 Adv. Space Res. 65 351 doi: 10.1016/j.asr.2019.09.048
    [12]
    Gushenets V I et al 1999 IEEE Trans. Plasma. Sci. 27 1055 doi: 10.1109/27.782281
    [13]
    Domonkos M T, Gallimore A D and Patterson M J 1997 An evaluation of hollow cathode scaling to very low power and flow rate Proc. of the 25th Int. Electric Propulsion Conf. Cleveland: IEPC 1997
    [14]
    Wirz R et al 2006 Discharge hollow cathode and extraction grid analysis for the MiXI ion thruster Proc. of the 42nd AIAA/ASME/SAE/ASEE Joint Propulsion Conf. & Exhibit Sacramento: AIAA 2006: 2006-4498
    [15]
    Samples S A and Wirz R E 2020 Plasma Res. Express 2 025008 doi: 10.1088/2516-1067/ab906d
    [16]
    Li F et al 2023 Vacuum 207 111492 doi: 10.1016/j.vacuum.2022.111492
    [17]
    Goebel D M et al 2005 J. Appl. Phys. 98 113302 doi: 10.1063/1.2135417
    [18]
    King S et al 2012 Small satellite LEO maneuvers with low-power electric propulsion Proc. of the 44th AIAA/ASME/SAE/ASEE Joint Propulsion Conf. & Exhibit Hartford: AIAA 2012: 2008-4516
    [19]
    Leomanni M et al 2017 Acta Astronaut. 133 444 doi: 10.1016/j.actaastro.2016.11.001
    [20]
    Pedrini D et al 2020 Aerospace 7 96 doi: 10.3390/aerospace7070096
    [21]
    Lev D et al 2019 Rev. Sci. Instrum. 90 113303 doi: 10.1063/1.5097599
    [22]
    Warner D J, Branam R D and Hargus W A 2010 J. Propul. Power 26 130 doi: 10.2514/1.41386
    [23]
    Conversano R W et al 2022 Acta Astronaut. 197 53 doi: 10.1016/j.actaastro.2022.05.015
    [24]
    Becatti G, Conversano R W and Goebel D M 2021 Acta Astronaut. 178 181 doi: 10.1016/j.actaastro.2020.09.013
    [25]
    Pedrini D et al 2018 IEEE Trans. Plasma Sci. 46 296 doi: 10.1109/TPS.2017.2778317
    [26]
    Vekselman V et al 2013 J. Propul. Power 29 475 doi: 10.2514/1.B34628
    [27]
    Rubin B and Williams J D 2008 J. Appl. Phys. 104 053302 doi: 10.1063/1.2973690
    [28]
    Tighe W et al 2005 Performance evaluation and life test of the XIPS hollow cathode heater Proc. of the 41st AIAA/ASME/SAE/ASEE Joint Propulsion Conf. & Exhibit Tucson: AIAA 2005: 2005-4066
    [29]
    Ning Z X et al 2018 Vacuum 155 470 doi: 10.1016/j.vacuum.2018.06.054
    [30]
    Gallimore A D, Rovey J L and Herman D A 2007 J. Propul. Power 23 1271 doi: 10.2514/1.27897
    [31]
    Polk J et al 2008 Ongoing wear test of a XIPS 25-cm thruster discharge cathode Proc. of the 44th AIAA/ASME/SAE/ASEE Joint Propulsion Conf. & Exhibit Hartford: AIAA 2008: 2008-4913
    [32]
    William G T and Chien K R 2005 Hollow cathode ignition and life model Proc. of the 41th AIAA/ASME/SAE/ASEE Joint Propulsion Conf. & Exhibit Tucson: AIAA 2005: 2005-3666
    [33]
    Katz I et al 2008 IEEE Trans. Plasma. Sci. 36 2199 doi: 10.1109/TPS.2008.2004363
    [34]
    Goebel D M et al 2023 Fundamentals of Electric Propulsion 2nd ed (New York: Wiley
    [35]
    Wang F et al 2022 J. Phys. D: Appl. Phys. 55 455202 doi: 10.1088/1361-6463/ac90ce
  • Related Articles

    [1]Jianqing CAI, Yunfeng LIANG, Zhongyong CHEN, Wei ZHENG, Di HU, Lei XUE, Zhifang LIN, Xiang GU, the EHL-2 Team. Disruption prediction and mitigation strategies in the EHL-2 spherical torus[J]. Plasma Science and Technology, 2025, 27(2): 024013. DOI: 10.1088/2058-6272/adb40b
    [2]Yumin WANG, Qifeng XIE, Renyi TAO, Hui ZHANG, Xiaokun BO, Tiantian SUN, Xiuchun LUN, Lin CHEN, Weiqiang TAN, Dong GUO, Bihe DENG, Minsheng LIU, the EXL-50 Team. Design of the electron cyclotron emission diagnostic on EXL-50 spherical torus[J]. Plasma Science and Technology, 2024, 26(3): 034008. DOI: 10.1088/2058-6272/ad0d54
    [3]Mingxiang HUANG, Zhengkang REN, Feiyue MAO, Zhoujun YANG, Yuan GAO, Zhichao ZHANG, Shunfan HE, Guoliang LI, Jinrong FAN, Wei TIAN, Nengchao WANG, Zhipeng CHEN, Yonghua DING, Yuan PAN, Zhongyong CHEN. Behavior of multiple modes before and during minor disruption with the external resonant magnetic perturbations on J-TEXT tokamak[J]. Plasma Science and Technology, 2022, 24(6): 064013. DOI: 10.1088/2058-6272/ac6da9
    [4]Bin CHEN, Yubao ZHU, Qing ZHOU, Jiangbo DING, Xianming SONG, Shaodong SONG, Yuanming YANG, Xin ZHAO, Enwu YANG, Minsheng LIU, the EXL-50 Team. Microwave preionization and electron cyclotron resonance plasma current startup in the EXL-50 spherical tokamak[J]. Plasma Science and Technology, 2022, 24(1): 015104. DOI: 10.1088/2058-6272/ac3640
    [5]Kai ZHANG (张凱), Dalong CHEN (陈大龙), Bihao GUO (郭笔豪), Junjie CHEN (陈俊杰), Bingjia XIAO (肖炳甲). Density limit disruption prediction using a long short-term memory network on EAST[J]. Plasma Science and Technology, 2020, 22(11): 115602. DOI: 10.1088/2058-6272/abb28f
    [6]Yonghua DING (丁永华), Zhongyong CHEN (陈忠勇), Zhipeng CHEN (陈志鹏), Zhoujun YANG (杨州军), Nengchao WANG (王能超), Qiming HU (胡启明), Bo RAO (饶波), Jie CHEN (陈杰), Zhifeng CHENG (程芝峰), Li GAO (高丽), Zhonghe JIANG (江中和), Lu WANG (王璐), Zhijiang WANG (王之江), Xiaoqing ZHANG (张晓卿), Wei ZHENG (郑玮), Ming ZHANG (张明), Ge ZHUANG (庄革), Qingquan YU (虞清泉), Yunfeng LIANG (梁云峰), Kexun YU (于克训), Xiwei HU (胡希伟), Yuan PAN (潘垣), Kenneth William GENTLE, the J-TEXT Team. Overview of the J-TEXT progress on RMP and disruption physics[J]. Plasma Science and Technology, 2018, 20(12): 125101. DOI: 10.1088/2058-6272/aadcfd
    [7]Jinxia ZHU (竹锦霞), Yipo ZHANG (张轶泼), Yunbo DONG (董云波), HL-A Team. Characterization of plasma current quench during disruptions at HL-2A[J]. Plasma Science and Technology, 2017, 19(5): 55101-055101. DOI: 10.1088/2058-6272/aa5ff2
    [8]WANG Bo (王勃), Robert GRANETZ, XIAO Bingjia (肖炳甲), LI Jiangang (李建刚), YANG Fei (杨飞), LI Junjun (李君君), CHEN Dalong (陈大龙). Establishment and Assessment of Plasma Disruption and Warning Databases from EAST[J]. Plasma Science and Technology, 2016, 18(12): 1162-1168. DOI: 10.1088/1009-0630/18/12/04
    [9]DING Yonghua (丁永华), JIN Xuesong (金雪松), CHEN Zhenzhen (陈真真), ZHUANG Ge (庄革). Neural Network Prediction of Disruptions Caused by Locked Modes on J-TEXT Tokamak[J]. Plasma Science and Technology, 2013, 15(11): 1154-1159. DOI: 10.1088/1009-0630/15/11/14
    [10]ZHUANG Huidong (庄会东), ZHANG Xiaodong (张晓东). Development of a Fast Valve for Disruption Mitigation and its Preliminary Application to EAST and HT-7[J]. Plasma Science and Technology, 2013, 15(8): 745-749. DOI: 10.1088/1009-0630/15/8/05
  • Cited by

    Periodical cited type(1)

    1. Hao, Y., Yang, Z., Tang, H. et al. Particle-in-cell simulations of collisionless perpendicular shocks driven at a laser-plasma device. AIP Advances, 2023, 13(6): 065302. DOI:10.1063/5.0142363
    1. Hao, Y., Yang, Z., Tang, H. et al. Particle-in-cell simulations of collisionless perpendicular shocks driven at a laser-plasma device. AIP Advances, 2023, 13(6): 065302. DOI:10.1063/5.0142363

    Other cited types(0)

Catalog

    Figures(13)  /  Tables(2)

    Article views (18) PDF downloads (6) Cited by(1)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return