Remote photoplethysmography for contactless pulse rate monitoring: algorithm development and accuracy assessment

0967-3334/46/11/115004

Objective. Remote photoplethysmography (rPPG) offers a promising method for contactless pulse rate (PR) monitoring, which is particularly valuable for remote patient care. However, signal noise—caused by factors such as motion and lighting—can significantly impact measurement accuracy. Approach. We present a hybrid algorithm that combines frequency-domain analysis to estimate initial PR and a time-domain approach to refine this estimate, improving robustness in challenging conditions. Main results. The combined method increases accuracy and success rate compared to time-domain methods alone. Evaluated against time-aligned electrocardiogram, it achieved a root mean square error (RMSE) as low as 2.0 bpm and an r2 of 0.96. On a larger outpatient dataset, the RMSE was 3.2 bpm with an r2 of 0.93. Importantly, no significant performance difference was observed across varying skin tones. Significance. These results demonstrate that the proposed PR algorithm enables reliable, contactless pulse monitoring in real-world conditions, supporting broader adoption of rPPG for inclusive and scalable remote health monitoring.

Export citation and abstractBibTeX RIS

Pulse rate (PR) is a critical vital sign widely used by healthcare providers for patient triage (Knaus 1992, Farrohknia et al 2011). Typically, PR is measured using a pulse oximeter, which utilises photoplethysmography (PPG) to detect changes in blood volume.

PPG signals can be obtained from various bare skin sites, but in adults, pulse oximeters are most commonly applied as finger clips. Alternatively, wearable devices such as smartwatches measure PR by capturing PPG signals at the wrist. While these methods are widely adopted, they have limitations. Both pulse oximeters and smartwatches rely on dedicated hardware, which may not always be available, particularly in remote care settings. Additionally, these devices require regular cleaning when used with multiple patients to prevent cross-contamination, a concern that became particularly prominent during the COVID-19 pandemic (Birlie et al 2021).

In 2008, Verkruysse et al demonstrated that PPG signals can also be captured remotely using standard consumer-grade cameras (2008). Remote PPG (rPPG) can use widely available devices such as smartphones and webcams, enabling patients to measure vital signs independently. In contrast to PPG, rPPG measures the amount of reflected light instead of absorbed and most commonly uses green light (approximately 550 nm), due to its peak haemoglobin absorption (Van Gastel et al 2016).

Despite its potential for enhancing accessibility and enabling remote healthcare, rPPG faces several technical and practical challenges that must be addressed for widespread adoption. The main challenge is to capture a clean, high-quality signal. Variation in environmental lighting can cause noise, because rPPG relies on reflected light (Lu et al 2020), and motion from the patient (or device) generates noise due to the relative movement of the selected area for signal capture (Rohmetra et al 2021). To overcome this additional noise from remote signal capturing, enhanced filtering and signal processing methods are required to achieve a clean, high quality signal, that captures the pulsatile component and not the added noise components (Waugh et al 2018). Capturing rPPG signals can also be more challenging in darker skin tones, due to the increased light absorption by higher levels of melanin in the skin. Contact-based PPG methods have also been known to suffer from accuracy problems, specifically when measuring SpO2, in darker skin tones (Jamali et al 2022). Due to the added challenges using remote monitoring, it is important to test the accuracy of rPPG vital sign measuring methods across different skin tones.

Different approaches have been used to extract the PR from rPPG signals, regardless of which skin area the signal was captured from. Commonly used approaches are machine learning models, such as convolutional attention networks or convolutional neural nets (Chen and McDuff 2018, Niu et al 2019). Deep learning methods are commonly tested on publicly available datasets such as COHFACE (Heusch et al 2017) or MAHNOB-HCI (Soleymani et al 2011). These databases contain a diverse set of participants, varying in age, sex, skin tone and spanning a wide range of PRs. A comprehensive review on PR measurement using rPPG and deep learning by Debnath and Kim (2025) shows that deep learning models can achieve mean absolute errors between 2.4 and 6.9 beats per minute on these datasets. While these models can be effective, they are computationally expensive and require a large dataset for model training. Moreover, their lack of interpretability, sometimes referred to as a ‘black-box’, makes it difficult to understand the underlying decision process. A more transparent, signal-processing-based alternative to extract PR from rPPG signals would be to use frequency domain methods. For example, Poh et al demonstrated they could use a fast Fourier transform (FFT) to identify the power spectrum and search for a peak in the range of frequencies where the PR would be expected (between 45 and 240 beats per minute) (Poh et al 2010). They also demonstrated that blind source separation, and specifically independent component analysis (ICA) improved their accuracy. ICA has been recommended as one of the techniques suitable to reduce noise artefacts in (r)PPG signals (Kim and Yoo 2006, Yao and Warren 2006), by separating the signal into statistically independent components and assuming the (unwanted) noise and (wanted) pulsatile signal are not correlated. Frequency based methods are computationally effective, but their accuracy is limited by a trade-off between resolution and time window size, particularly when data are acquired at the relatively low sampling rates of consumer-grade cameras in smart devices, which are typically limited to 30 or 60 frames per second. In addition, they are not suitable for calculating the variation in PR. Heart rate variability (HRV) is an important indicator for cardiac health. Spatio-temporal methods, such as looking at identifying the individual peaks and pulse intervals in the time domain, can be used to measure PR and HRV, however, can be harder to use in noisy signals such as rPPG (Hu et al 2024). A previously described peak counting algorithm for PR calculation in (van Putten et al 2024) demonstrated good accuracy results, however, this version relied on a rough seed value for the algorithm to work. In this manuscript, we present an improved PR algorithm based on a combination of frequency domain methods and spatio-temporal peak counting. We validated this algorithm on two different datasets: a larger dataset collected in hospital and a smaller dataset collected in a clinical laboratory where simultaneous electrocardiogram (ECG) were captured.

2.1. Signal capture

To capture rPPG signals, the captured video should contain enough skin area with sufficient blood flow to capture the pulsatile signal. Due to its small amplitude, this signal is extremely susceptible to changes in ambient light, facial movement and other external noises. To avoid disturbances from these external factors as much as possible, the mid-face, containing the cheeks and nose as shown in figure 1 is used (Lempe et al 2013). To detect the landmarks, a pre-trained facial landmark detector from the dlib library is used to estimate the location of 68 facial landmarks (King 2009, Kazemi and Sullivan 2014). The mid-face region was constructed as two vertically stacked rectangles covering the area from just below the eyes to the base of the nose. The top rectangle spanned a wider horizontal range, from each outer eye corner to a point midway toward the corresponding facial boundary. The bottom rectangle was narrower, spanning only between the outer corners of the eyes. This design avoids regions at the facial edges that are more prone to motion artefacts inducing noise, and can be computed efficiently for real-time analysis. It also contains the main branches of the infra-orbital artery below the eye socket and is therefore believed to have a strong pulsatile signal (Hufschmidt et al 2019). On top of capturing enough skin area, the sampling rate of the video needs to be sufficient to detect individual heart beats. The data for this study was captured using multiple iPad 8s with a frame rate of 30 frames per second.

Figure 1. Detected facial landmarks shown with the derived mid-face region of interest highlighted in black.

Download figure:

Standard image High-resolution image

For each frame in the video, the light intensity for each colour channel (red, green, blue [RGB]) is spatially averaged across the mid-face region. This creates a one-dimensional array over the whole video duration for each colour channel, where each value in the array represents the average intensity of all pixels in the mid-face region for each timestamp. These three 1-dimensional colour channel arrays are much smaller, enabling fast upload of the rPPG signal over mobile networks, but also benefit of not having personally identifiable images, addressing concerns relating to data protection.

2.2. Signal cleaning

The first step before the PR can be calculated is signal cleaning, especially in rPPG signals which are inherently noisier than contact PPG signals. Multiple rPPG extraction options are available, including just using the green channel, chrominance (CHROM) combinations and principal component analysis (PCA) or independent component analysis (ICA) based projections (Khaleel Sallam Ma’aitah and Helwan 2025, Shahmirzadi et al 2025). For this work, plane orthogonal to skin (POS) method of Wang et al was selected (2016), because it operates on temporally normalised RGB with a fixed linear projection, requiring no model training or extensive hyperparameter tuning. In practice, it is less sensitive to global intensity fluctuations and moderate motion than the green channel alone, resulting in cleaner signals and derived frequency spectra. The POS signal is calculated using the averaged RGB signals from each frame. The signals are combined into a single matrix:

where Rt, Gt, and Bt represent the mean values of the red, green, and blue channels at time t, and T is the total number of frames. For each of the colour channels, we apply temporal averaging using a sliding window of length l:

Using the temporal sliding averages, a mean diagonal matrix is constructed:

Which can be used to normalise the input matrix:

These three channels are mapped using a projection matrix, defined as:

When the projection matrix is applied, this results in a transformed 2D signal:

This results in a matrix which could be described as two 1D signals:

where $R_n(t)$ , $G_n(t)$ and $B_n(t)$ are the normalised red, green and blue channels respectively. These two 1D signals can be combined into a tuning factor, α,

where $\sigma_$ and $\sigma_$ are the standard deviations (SDs) of the projected signals S1 and S2. The tuning factor,α, can be used to calculate the final 1D signal that is used for PR extraction:

This method is particularly effective in extracting the pulsatile rPPG signal from facial videos, because it leverages spatial redundancy in colour signals to overcome motion-induced noise. The pulsatile components from $S_1(t)$ and $S_2(t)$ are added, whereas specular reflections are cancelled. This approach works on the assumption that the pulsatile signal is consistent, while noise is not spatially correlated (Wang et al 2014). As a result, the POS signal has a higher signal-to-noise ratio compared to the individual colour channels, and can extract the pulsatile signal from RGB signals even when the lighting varies or the pulsatile signal is weaker, for example in those with lower pulse pressure or darker skin tone.

2.3. Algorithm description2.3.1. Time based PR estimation

To calculate the PR we use the first derivative of the time-domain signal. Peralta et al found that the apex in the first derivative was one of the more suitable reference locations in PPG signals to study HRV, and therefore also to identify individual pulses (Peralta et al 2019). The peaks in the first derivative correspond to the point in the systolic upstroke where the speed of the flow reaches its maximum. We applied a 4th-order band-pass filter (0.125–4 Hz) to the time-domain signal, followed by a smoothed first derivative computed with a 5-point window for peak detection. Peaks are located by finding local maxima, however, peaks do not necessarily correspond with pulses, even after filtering the signal, due to the noisy nature of rPPG. To determine whether a peak in the first derivative belongs to a fiducial point, there are quality checks on both the amplitude of the peak and the duration of the pulse between peaks. To verify whether the amplitude of the peak indicates a fiducial point, an adaptive decaying algorithm is used; for each additional frame where no peak is found, the threshold to determine whether a maxima location found is of sufficient amplitude to be a fiducial decays exponentially. Detected peaks are only recognised as fiducial points if the amplitude exceeds the dynamic thresholds. Figure 2 shows the first derivative with the decaying threshold. It can be seen that peaks below the threshold are not recognised as fiducial points, and that the thresholds resets at each found peak, based on the amplitude of the peak found. Small peaks in the diastolic phase are therefore discarded as fiducials. On top of this amplitude threshold, the number of frames between the found maxima locations is used to determine whether the pulse interval found should be considered a valid interval. As a result, the interval finding algorithm needs an initial guess value for the PR. As described in van Putten et al (2024), this can be done using a rough estimate based on an initial peak count, where the PRs were categorised in three categories: low ( $\lt$ 64 bpm), normal (64–84 bpm) and high ( $\gt$ 84 bpm). However, it was found that this could cause problems for very low and very high PRs, where the true pulse interval was never considered a valid interval and fiducial location by the algorithm. To overcome this, an improved method to seed the interval finding algorithm is required.

Figure 2. Section of rPPG signal demonstrating the peak detection using an adaptive decaying threshold. Identified local maxima are only considered true fiducial points when the peak in the first derivative is above the adaptive decaying threshold.

Download figure:

Standard image High-resolution image 2.3.2. Frequency based PR estimation

Instead of using a time-domain based method to calculate the seed value for the interval finding algorithm, a frequency domain method is used. Frequency domain methods as mentioned before are not suitable to calculate HRV metrics, but can also suffer from a limited resolution, especially when the signal is captured at lower sampling rates. For the purpose of seeding the interval finding algorithm this will not limit the combined algorithm, because the resolution is much finer than the 3-category approach previously used. To achieve a noise-robust spectral power density estimation, Welch’s method is used (Solomon 1991). Welch’s method is a spectral estimation technique that reduces variance in power spectral density (PSD) estimation by segmenting the signal into overlapping windows, applying a windowing function to each segment, and averaging the periodograms. This approach mitigates spectral leakage and improves frequency resolution compared to a single Fourier transform of the entire signal. The method involves dividing the signal into overlapping segments, applying a window function to reduce edge effects, computing the FFT of each windowed segment, and averaging the squared magnitudes of the FFTs. In the PSD diagram, the PR can be found by locating the frequency of the highest power peak. This method was successfully used for PR measurements in paediatrics (Wegerif et al 2024), where time domain methods are more challenging due to the increased noise associated with movement in minors. In figure 3 the PSD diagram for a single measurement is shown for the different colour channels and for the POS signal. This measurement was taken on a 70 year old, male volunteer with a measured PR of 62 bpm. It can be seen that the PR matches the peak location in the PSD spectrum for the POS signal; the peak is found at 0.94 Hz, around 56 bpm. Figure 3 also illustrates that the resolution of the Welch method alone is not sufficient to calculate the PR; but when this seed value is used in the time domain interval finding algorithm an average interval duration of 976 ms is found, resulting in a final PR result of 61 bpm, closely matching the ground truth value.

Figure 3. Welch power spectral density diagram for POS and RGB signals for a measurement on a 70 year old, male volunteer with a measured PR of 62 bpm.

Download figure:

Standard image High-resolution image

While the resolution of Welch’s method alone is not accurate enough for a PR algorithm in clinical use, this approach results in a more precise seed value for the interval finder, overcoming the problem for low and high PRs that previously lead to wrong or rejected readings. In figure 4(a) diagram overview of the process of the proposed PR extraction is shown. The calculated PR is only considered a valid calculation when the signal quality is considered high enough.

Figure 4. Overview of the steps for PR calculation with differences highlighted between the time domain and time and frequency domain combined method. The elliptical shapes highlight the differences between the algorithms, with the time domain method visualised in grey ellipses, and the combined time and frequency domain method visualised in white ellipses.

Download figure:

Standard image High-resolution image 2.3.3. Signal quality assessment

Welch’s method not only allows for a more accurate seed value for the interval finding algorithm but also provides information on the signal quality. An example of signal quality metrics is the SNR, which in this case is calculated using the ratio between the power in the PR frequency bin and the power in the rest of the frequency spectrum. If this ratio is too low, the signal contains more noise than pulsatile signal and it may not be possible to extract an accurate PR. However, this is not the only signal quality criteria used to decide whether a measurement is of sufficient quality to calculate the PR. The time domain only PR algorithm used the variability in pulse length as a signal quality metric, as described in (van Putten et al 2024). Summarised, the pulse quality was assessed by taking an average of all pulse intervals found, τ, and the SD of all intervals, σ. The signal quality index (SQI) was then calculated using:

While this worked well to an extent, problems can occur when not many pulses are found in a noisy signal. The SD needs a larger number to give a true indication of the variability in pulse length, and as a result noisy signals with fewer pulses could get falsely high signal quality indices. This inspired a new signal quality metric: the ratio of observed to expected pulses. Once the PR (in bpm) is calculated, the predicted number of pulses, N can be calculated based on the measurement length, L in seconds:

With the predicted number of pulses, $N_\mathrm$ , and the number of found pulses, $N_\mathrm$ , the pulse yield (PY) can be calculated and used to identify what percentage of the signal contains useful pulsatile information,

3.1. Datasets

We have two datasets to evaluate the accuracy of the PR algorithm. The first dataset was collected during the VISION-MD study (Wiffen et al 2023). Participants took part in data collection during outpatient hospital appointments. During data collection, upto three high-quality, non-compressed videos of their face were recorded to extract rPPG signals as described in section 2.1. Before each video recording, the participant’s PR was measured before the video recording was taken using a pulse oximeter finger clip. The participant’s age, sex, height and skin tone based on the Fitzpatrick scale were also recorded. After data cleaning, 776 subjects with 2197 measurements remain. Of these, 47% are male and 53% are female. The youngest participant is 18 years old and the oldest is 89 years old, with a mean age of 56 (±16.5) years. In this dataset, skin tone was recorded by the research nurses based on visual assessment. The second dataset was collected in collaboration with a clinical laboratory and consists of only 85 participants with three measurements each. In this study, the ground truth PR was captured using a pulse oximeter, like in the VISION-MD study, but also using time-aligned ECG measurements. Because PR is not a static vital sign, comparing the PR algorithm against a single spot measurement is likely less accurate than comparing it against the PR measured via ECG in the exact same timeframe. This dataset consists of 85 participants, with 255 measurements in total. Of these, 33% are male and 67% are female. The youngest participant is 20 years old and the oldest is 77 years old, with a mean age of 48 (±13.6) years. In this dataset, the melanin content was measured using a Mexameter MX 18 on the inside of participants’ arms to determine the skin tone. An overview of participant demographics, including age, sex, and skin tone for both datasets, is presented in figure 5.

Figure 5. Overview of the distribution of participant age, sex, and skin tone across the two datasets.

Download figure:

Standard image High-resolution image 3.2. Performance metrics3.2.1. Usability

Previously, the algorithm would only return a PR if the $\mathrm$ was more than 0.8. In the proposed algorithm that uses frequency domain, the $\mathrm$ has to be more than 0.5, meaning that at least half the signal contains usable pulsatile information, and the SNR has to be more than 0.15, usually expressed in decibel (i.e. −8.2 dB). Requiring more than half of the signal to contain detected pulses ( $\mathrm\unicode0.5$ ) mitigates frequency-doubling at lower PR by favouring the modal (fundamental) component over accidentally marking the diastolic peak as a second pulse. The SNR cut-off was selected on overviews of SNR-error scatterplots, where values below $0.15$ were associated with markedly higher and more variable errors. While understanding and comparing the accuracy of both algorithms is important, it is also essential to assess the usability of the PR algorithms. To be used in medical settings, the algorithm must be robust, reliable, and capable of calculating PR even in poor signal quality conditions and across the full range of skin tones. To evaluate its performance in terms of usability, we measure the percentage of successful measurements in the VISION-MD study based on the different signal quality metrics described in section 2.3.3.

3.2.2. Accuracy

The required accuracy for a PR monitor is to achieve a root mean square error (RMSE) within $ 5$ bpm or ±10%, whichever is greater, following the specifications from ANSI (Association for the Advancement of Medical Instrumentation et al 2002). On top of RMSE, we consider the mean error (ME) and SD. Whilst we look at accuracy across the full dataset, we also tested for group differences by skin tone using a one-way ANOVA on per-measurement MAE and ME. We are also interested in the accuracy against the different signal quality metrics to compare the old and new signal quality criteria.

4.1. Usability

In tables 1 and 2, an overview of the success rate is given for the two algorithms. In our pipeline, a PR is returned only when quality thresholds are met (i.e. $\mathrm\unicode0.5$ and $\mathrm\unicode0.15$ for the proposed algorithm, and $\mathrm\unicode0.8$ for the time domain only algorithm). Usability therefore reflects the percentage of signals that meet these criteria. It can be seen that the proposed algorithm is much more usable across the full dataset, but it remains more challenging in participants with darker skin tones. There is no significant difference in success rate between male or female participants. Darker skin tones are more challenging because visible-light rPPG relies on subtle colour modulations that are weaker when melanin absorption is higher. Under the same lighting, the pulsatile contrast is reduced, so fewer signals meet the $\mathrm$ / $\mathrm$ criteria. Other reasons for failing to meet the quality thresholds include motion or pose changes during measurement. Since the mid-face area used for signal calculation is fixed once acquisition starts, movement can shift this region away from skin or introduce noise from facial edges. Lighting changes can also degrade the signal, as low intensity or strongly varying illumination reduces the contrast between the pulsatile component and background noise.

Table 1. Successful measurements categorised by sex.

Successful measurements by sexAll (%)Female (%)Male (%)Time domain only42.339.945.0Frequency and time domain combined83.782.684.9

Table 2. Successful measurements categorised by skin tone on the Fitzpatrick scale.

Successful measurements by skin tone1 & 2 (%)3 & 4 (%)5 & 6 (%)Time domain only45.942.122.9Frequency and time domain combined89.285.149.24.2. Accuracy in VISION-MD

In figure 6 scatter plots demonstrating the performance of both algorithms are shown. It can be seen that using the frequency method to find the seed value for the time domain interval detection algorithm has a better accuracy with a lower RMSE of 3.2 bpm compared to the previous result of 11.1 bpm. The scatterplot clearly shows the time domain only algorithm was susceptible to overestimating lower PRs significantly. RMSE is a useful metric, but a small number of large outliers can severely affect the RMSE. For completeness, MAE, ME and SD for both of the algorithms are shown in table 3. It can be seen that the proposed algorithm outperforms the previous version on all metrics.

Figure 6. Accuracy comparison between time domain only, and proposed, frequency and time domain combined, PR algorithm.

Download figure:

Standard image High-resolution image

Table 3. Accuracy metrics for time domain only, and proposed, frequency and time domain combined, PR algorithm.

ME (bpm)MAE (bpm)SD (bpm)Time domain only−3.34.710.6Frequency and time domain combined−1.02.33.1

While an improved accuracy is a significant achievement, it is also important for a medical device to achieve equal performance in all patient groups. For rPPG, one challenging factor can be skin tone, because the increased light absorption by higher levels of melanin in the skin affects the amplitude of the pulsatile signal (Fallow et al 2013). Figure 7 shows accuracy across skin-tone groups. A one-way ANOVA found no effect of skin tone on either ME or MAE: ME, $F(2,1834) = 0.725,\, p = 0.485$ ; MAE, $F(2,1834) = 1.184,\, p = 0.306$ . To ensure completeness, ME, MAE and SD for the different skin tone groups are shown in table 4.

Figure 7. Accuracy of the proposed, frequency and time domain combined, PR algorithm grouped by skin tone.

Download figure:

Standard image

View original article

PHYSIOLOGICAL MEASUREMENT

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Remote photoplethysmography for contactless pulse rate monitoring: algorithm development and accuracy assessment

Comments (0)