Have you ever wondered how to easily download a full website to PDF or JPG? Of course, you can use “Print” functionality of your browser and then choose “Save as PDF” option or simply screenshot it. The situation gets a little bit more complicated if the website is very long or if you need to download multiple website/pages at once. Today we wanted to show you how we do it!
We will be using Linux shell and bash language to easily accomplish that goal.
How to save a website to PDF – approach 1
Let us download 10 pages with incrementing numbers. For that purpose we will use the wkhtmltopdf library which you can install by running the following command on Debian based Linux systems:
sudo apt install wkhtmltopdf
After that you can create a file by running:
touch download.sh
and then
chmod +x download.sh
to make the file executable. Now open the file in your favorite editor and paste the following lines.
#!/bin/bash
for i in {1..10}
do
declare url=https://someurl.com/page_$i.xhtml
wkhtmltopdf -s A4 --disable-smart-shrinking --zoom 1.0 $url output_file_$i.pdf
done;
Now you can save the file and then run it by executing:
./download.sh or sh download.sh
Now let us explain:
- #!/bin/bash is an opening line for Bash script
- next, we start a loop which will run 10 times, each time replacing $i variable with incremented numbers.
- we declare a variable called “url” which you will have to replace with the address of the page that you want to download.
- and finally we download the entire pages to a PDF files, those will be named output_file_1.pdf, output_file_2.pdf and so on.
There is also –zoom parameter which you can increase if the PDF printout is not taking up entire A4 page.
How to save a website to PDF – approach 2
Sometimes we are experiencing issues with the wkhtmltopdf library, for some websites we simply get a blank page as a result. The best solution that we came up with was to use “cutycapt” library. Here is how you can do it:
#!/bin/bash
for i in {1..10}
do
declare url=https://someurl.com/page_$i.xhtml
cutycapt --min-width=1024 --min-height=1280 --zoom-factor=1.0 --url=$url --out=output_page_$i.png
done;
You might need to install cutycapt first, for Debian based system you can do it by running:
sudo apt install cutycapt
As you can see cutycapt command is very similar to wkhtmltopdf, also is having –zoom-factor parameter which you can increase to fill the entire page.
How to save a website to JPG
This is a very similar approach to approach number 1. You need to replace “wkhtmltopdf” command with “wkhtmltoimage”. Here is how the script would look like:
You might need to install the library by running
sudo apt install wkhtmltoimage
and then creating a file and pasting:
#!/bin/bash
for i in {1..10}
do
declare url=https://someurl.com/page_$i.xhtml
wkhtmltoimage --width 900 --height 1280 --zoom 1.6 $url output_file_$i.pdf
done;
Similarly, we are downloading 10 pages and the output will be saved as a JPG file, we can specify JPG size by adjusting –width and –height parameters and –zoom
The title of the post was “How to save a full website to PDF” so you should be wondering now, what do I do with multiple PDF files?
We have an answer to that too!
Just open your terminal in the same folder where did you download pages and run the following command:
pdfunite $(ls -v *.pdf) output.pdf
pdfunite library is what we use to join multiple PDF files into 1. By adding ls -v you will make sure that files will be joined together in the right order 🙂
That’s it for now if you have any questions please leave a commend below.