Protecting sensitive data from bots

A self-hosted solution using ALTCHA, PHP, and JavaScript

Protecting sensitive data from bots

In this post, we will cover how to prevent malicious bots from scraping sensitive information from your website. For the sake, of privacy we will implement a self-hosted solution using ALTCHA, PHP, and JavaScript.

Problem and motivation

Bots on the internet automatically scrape websites for information. While there are valid reasons, such as search engine indexing, malicious bots look for private data like names, email addresses, and phone numbers. Your sensitive information could potentially be sold, bought, and used by scammers or identity thieves. Therefore, you may want to protect yourself from such bots and provide your sensitive data only to legitimate human users.

There are a number of solutions for bot protection. For example, CAPTCHAs request users to decipher and enter numbers and letters from an image. However, advances in machine learning make it possible to solve CAPTCHAs automatically. On the other hand, proof-of-work methods demand to solve a complex computational problem, requiring some processing time, before the solution is sent back to the server for validation. A long processing time renders solving the problem by a bot infeasible, as bots aim to scrape thousands of pages quickly and in parallel. Thus, a bot would not execute or would abort the code needed to solve the problem, and your sensitive data would not be revealed.

ALTCHA bot protection

ALTCHA is open-source software for bot protection using the proof-of-work principle. It is privacy-friendly and can be self-hosted on your own server. We will now look at how to implement it in a minimal example. You can easily adapt this example to your website in order to protect your sensitive data from web-scraping bots.

Dependency installation

First, we need to install the ALTCHA dependencies. We assume Composer and NPM are installed. The running the following lines in your terminal from an empty directory will setup our working environment.

composer require altcha-org/altcha
npm install altcha

Base HTML and JavaScript

In our example, we create a very minimal web page, where the sensitive data will be shown to humans but not to bots. First, ALTCHA’s JavaScript components are included (line 5). Then, in the body, we present the ALTCHA widget as a form, where secret.php handles the ALTCHA challenge and validation (lines 12–15). We wrap the form inside an identifiable div block, which is later dynamically replaced by the hidden information (lines 10, 16). Finally, the JavaScript code at the bottom (lines 18–34) calls the validation PHP code (see next section) and, after successful validation, replaces the ALTCHA widget with the sensitive information (line 31).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
<!DOCTYPE html>
<html>
<head>

<script async defer type="module" src="node_modules/altcha/dist/altcha.js"></script>

</head>
<body>

<div id="replaceThis">
<p>Please confirm here, that you're not a bot.</p>
<form id="myForm" method="POST" action="secret.php" enctype="multipart/form-data">
<altcha-widget challengeurl="secret.php"></altcha-widget>
<button type="submit">I'm human!</button>
</form>
</div>

<script>
/* Sending the formData object as payload using Fetch */
const form = document.getElementById('myForm');
const repl = document.getElementById('replaceThis');

form.addEventListener('submit', function(e) {
    // Prevent default behavior:
    e.preventDefault();
    // Create payload as new FormData object:
    const payload = new FormData(form);
    // Post the payload using Fetch:
    fetch(form.action, { method: 'POST', body: payload, })
    .then(resp => { console.log(resp); return resp.json(); })
    .then(data => { console.log(data); repl.innerHTML = data['html']; })
    .catch((error) => console.error('Error: ', error));
});
</script>

</body>
</html>

Save our HTML base to a file named index.html in our working directory.

ALTCHA backend using its PHP library

The ALTCHA PHP library is well documented and straightforward to use. Our PHP script first prepares the ALTCHA challenge (lines 9–15). Make sure to change the placeholder to truly secret key (line 9). On a GET request, it presents the challenge to the client (lines 18–22), and on a POST request it validates the client’s solution (lines 30–44). After successful validation, our sensitive information (line 26) is returned to the client (lines 40–43).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
<?php

require 'vendor/autoload.php';

use AltchaOrg\Altcha\ChallengeOptions;
use AltchaOrg\Altcha\Altcha;

// initialize ALTCHA with custom secret key
$altcha = new Altcha("SecretHMACKey");

// create a new challenge
$options = new ChallengeOptions(
    maxNumber: 50000, // the maximum random number
    expires: (new \DateTimeImmutable())->add(new \DateInterval('PT10S')),
);

// send the challenge to be solved on GET request
if ( $_SERVER['REQUEST_METHOD'] === 'GET') {
    $challenge = $altcha->createChallenge($options);
    header('Content-Type: application/json; charset=utf-8');
    echo json_encode($challenge);
}

// store protected information as HTML code here
$myHtmlOutput = <<<HTML
<p>This information is protected from bots, shown only after validation.</p>
HTML;

// and process passed challenge solution on POST request
if ($_SERVER['REQUEST_METHOD'] === 'POST') {
    // decode sent solution
    $payload = $_POST['altcha'] ?? '';
    $decodedPayload = base64_decode($payload);
    $payload = json_decode($decodedPayload, true);

    // verify the solution
    $ok = $altcha->verifySolution($payload, true);

    // if solution is valid, then provide the protected information
    if ($ok) {
       header('Content-Type: application/json; charset=utf-8');
       echo json_encode(['success' => true, 'html' => $myHtmlOutput]);
    }
}

Save above code in a file named secret.php in the same directory as the index.html file.

Testing the validation

For testing our code, start a local PHP web server from our working directory:

php -S localhost:8000

Then open your web browser amd navigate to http://localhost:8000.

The website and ALTCHA widget should now behave as shown in the following animation.

Hopefully, these short code snippets will help you fend off malicious scraping bots and protect the information on your website.