Downloading Syosetu Novels

Knowledge Base Categories

Public
- Personal Configs

Downloading Syosetu Novels

Sep 26, 2021

This article explains how to download web novels from syosetu.com for offline viewing using wget.

But don't even bother with wget! Use my ready-made tool, mkshousetsu instead. It'll download the entire novel and convert it to an EPUB.

Or just download it as a PDF. That's easier than what's below.

Prerequisites

A stable internet connection
wget and relevant dependencies
Patience

Process

Find a web novel you want to download. Let's use https://ncode.syosetu.com/n8792em/ as an example.

Open a terminal, navigate to the folder you want to download the novel in, and execute this command:

wget -w 1 -np -page-requisites -r -l 1 -E -H -k https://ncode.syosetu.com/n8792em/ -e robots=off --continue

Not all of these options may be necessary, but most of them are. In order, this is what they do:

-w waits for a specified number of seconds. This is important because Syosetu (and most other sites) will block wget HTTP requests because it requests pages too quickly. Additionally, it reduces the strain on the server, so it's polite.
-np applies only when the -r option is used. It's short for "-no-parent", which means that it won't crawl up the directory tree to download pages or resources. In other words, it only crawls down the tree. This cuts down massively on the amount of pages and resources wget downloads and significantly reduces the complexity of the resulting directory tree. Unless you want everything on a website (you don't in this instance), use this.
-page-requisites downloads the necessary resources for a page to function/display properly. Without it, the page will be unstyled except for inline styles. We don't need any of the Javascript, but I don't know how to get rid of it, short of deleting the resources every time, so I just block those resources.
-r downloads pages recursively. Without this, only the page you specify (the index page) would be downloaded.
-l takes an integer and specifies how many levels wget crawls recursively. I've set this to 1 so that it only crawls down 1 time.
-E ensures that documents of type text/html and text/css are affixed with a .html or .css extension respectively. This is helpful for websites that have pages that end with .php, .asp, etc. This isn't helpful for Syosetu, but it's a good default option.
-H downloads pages/resources from websites that are not a part of the domain you specify. This is helpful for sites like Syosetu which rely on a CDN.
-k is a post-download option that rewrites all of the links so that the downloaded files reference each other locally if they exist. It also ensures that .html files pull CSS/JS resources locally instead of from a remote website. This is very important. I can't think of a scenario where this wouldn't apply.
-e robots=off tells wget not to listen to robots.txt. By default, wget will listen to robots.txt, which may limit the pages it crawls. If wget doesn't download all of the necessary pages you need, this is a helpful option to use. The -e option tells wget that what follows is a command that would normally be in .wgetrc, in this case robots=off.
--continue is helpful for continuing partial downloads that were interrupted. This will save your internet connection, but not necessarily your time, as it still goes through all of the pages/resources.

Syosetu may or may not be protected by Cloudflare. As a result, it may block wget because it isn't authenticating with cookies. To get around this, create a new Firefox profile, download the cookies.txt extension, and then navigate to the web novel page in your browser. Extract the cookies with cookies.txt and save them in the same folder or the folder above where you're downloading the web novel.

Then, use this command:

wget -w 1 -np -page-requisites --load-cookies syosetu-cookies-file.txt -r -l 1 -E -H -k https://ncode.syosetu.com/n8792em/ -e robots=off --continue

Regardless of which command you need to use, this is probably going to take 15 minutes or so. So, get patient.

After you've finished downloading your novel, due to the complex nature of the resulting directory tree and semantically useless directory names, I recommend that you create a symbolic link to the novel in the parent directory with a helpful name. Doing this will be much more helpful for future web novels that you download.

Lastly, if you can't stand the light theme, and you also can't stand Javascript, read through this article to learn how to create and use Firefox's userContent.css file to fix that.