• The site has now migrated to Xenforo 2. If you see any issues with the forum operation, please post them in the feedback thread.
  • Due to issues with external spam filters, QQ is currently unable to send any mail to Microsoft E-mail addresses. This includes any account at live.com, hotmail.com or msn.com. Signing up to the forum with one of these addresses will result in your verification E-mail never arriving. For best results, please use a different E-mail provider for your QQ address.
  • For prospective new members, a word of warning: don't use common names like Dennis, Simon, or Kenny if you decide to create an account. Spammers have used them all before you and gotten those names flagged in the anti-spam databases. Your account registration will be rejected because of it.
  • Since it has happened MULTIPLE times now, I want to be very clear about this. You do not get to abandon an account and create a new one. You do not get to pass an account to someone else and create a new one. If you do so anyway, you will be banned for creating sockpuppets.
  • Due to the actions of particularly persistent spammers and trolls, we will be banning disposable email addresses from today onward.
  • The rules regarding NSFW links have been updated. See here for details.

Fanfic.net Purge

Nihlus Cain

Know what you're doing yet?
Joined
Jul 6, 2018
Messages
144
Likes received
965
Hello everyone I do not know if you are aware by fanfic.net has instituted a purge of the more smutty stories on the site but I do not know if that is the end of it.
I suggest you save and download your favourite stories before the threat of them being deleted forever is a possibility.
 
Hello everyone I do not know if you are aware by fanfic.net has instituted a purge of the more smutty stories on the site but I do not know if that is the end of it.
I suggest you save and download your favourite stories before the threat of them being deleted forever is a possibility.
…I don't know a language or a meme that can accurately express my emotions on that
 
I had a feeling this day was coming, Fanfiction.net has always been against smut even if they haven't ever actually enforced it. I had hoped with how dead the website was the owners either wouldn't care or wouldn't want to risk alienating the small audience they still have. I wonder how many jannies they have available and how far they actually intend to take this purge? Cause you have a website with easily tens of thousands of stories, unless you get an army or a group of people willing to spend ten hours a day for a few weeks cleaning it up, your not going to scratch the surface with the amount of smut on that website.
 
So, which stories and authors were deleted/banned?
 
I had a feeling this day was coming, Fanfiction.net has always been against smut even if they haven't ever actually enforced it. I had hoped with how dead the website was the owners either wouldn't care or wouldn't want to risk alienating the small audience they still have. I wonder how many jannies they have available and how far they actually intend to take this purge? Cause you have a website with easily tens of thousands of stories, unless you get an army or a group of people willing to spend ten hours a day for a few weeks cleaning it up, your not going to scratch the surface with the amount of smut on that website.
IIRC it's literally one dude with a shitload of automated help, that's why stuff builds up and then we get purges.
 
If anyone is looking for a story that is missing, if you have the name and author, or alternatively the story ID, I have a scrape of the database a couple years old. About 300 GB of text files. Ping me, and I'll see if it's in my database.

Download stories how?

The stupidly intense Cloudflare nonsense has pretty much dodoed all the third party downloaders.

I saw some luck with uh, I think I used a docker image that coupled a scraper with a auto-captcha solver?

I'm not sure if it'd work anymore though. Captcha's have changed quite a bit over the last few months. Next time I need to download stories, I might need to cycle through IP's to scrape the few I want from the database.
 
That's incredibly based.

I'm surprised the size is so small.

17 GB of text files is quite a lot of content.

That's tens of thousands worth of novels.

I heavily suspect that's because the dump is incomplete.

So on 2015-10-25 you had this dude dump the initial database as a 106 GB .tar. That wasn't very useful for people who aren't familiar with massive data applications, so on 2016-03-29 he repacked it into .zip and .rar and organized the filenames along with adding an .sqlite directory. This edition is still well over 100 GB when it comes to archives alone.

But there were some problems. Some catagories were missing becuase of how everything was sorted, so on 2019-3-4, he uploaded a corrected version with these missing catagories added. This archive is still 113 GB compressed.

But you're looking at oh, a good four years since the archive was scraped, which IIRC, took at least two years for all ten million files. Plenty of time for things to be updated, which the archive misses. So I think the scraper rescraped all the updated stories and threw them up in a single torrent for people to grab. It's what I would do, and it's the smart way to do things.

BTW, I have both the updateablefanfic archive and the 2019 redux. I haven't unpacked the updateablefanfic archive, but I can tell you the master redux is only 113 GB's if you leave it in the archives. Take 'em out, and you're looking at 300gb of text files.

How far back does it go? There's one TTGL fanfic by Necromander I never got around to reading before he deleted it in the early 2010s.

Probably doesn't have it. IIRC, the dude started scraping fanfic.net in 2013, so he wouldn't've managed to grab the story as it was already deleted.

EDIT; FWIW, these file sizes are more or less what I would expect. I recently got a pretty recent scrape of the AO3 database, and the sizes your dealing with here is similar. The AO3 dump is 539 GB, and took me about two months to download due to the internet archive throttling my IP.

EDIT 2: Oh, and the updateablefanfic isn't 17 GB. The internet archive breaks big downloads into chunks if your doing a DD. The updateablefanfic is 56 GB which I'd say would probably be another 150 gb if the scraper uses his previous compression schemes.
 
Last edited:
I heavily suspect that's because the dump is incomplete.
Makes sense, I suspected as much.

But there were some problems. Some catagories were missing becuase of how everything was sorted, so on 2019-3-4, he uploaded a corrected version with these missing catagories added. This archive is still 113 GB compressed.

But you're looking at oh, a good four years since the archive was scraped, which IIRC, took at least two years for all ten million files. Plenty of time for things to be updated, which the archive misses. So I think the scraper rescraped all the updated stories and threw them up in a single torrent for people to grab. It's what I would do, and it's the smart way to do things.

BTW, I have both the updateablefanfic archive and the 2019 redux. I haven't unpacked the updateablefanfic archive, but I can tell you the master redux is only 113 GB's if you leave it in the archives. Take 'em out, and you're looking at 300gb of text files.
Ah, so here's to hoping the file/fic names still match so I can reconstruct an up-to-date dataset and then shove that into an SQLite FTS5 table.

EDIT; FWIW, these file sizes are more or less what I would expect. I recently got a pretty recent scrape of the AO3 database, and the sizes your dealing with here is similar. The AO3 dump is 539 GB, and took me about two months to download due to the internet archive throttling my IP.
Yeah, the throttling kinda sucks.

I've considered the torrent but... the privacy aspect with clearnet torrents always seem iffy. And Internet Archive's torrents are not-too-rarely broken anyway too.

Would this be the Ao3 dump?

EDIT 2: Oh, and the updateablefanfic isn't 17 GB. The internet archive breaks big downloads into chunks if your doing a DD. The updateablefanfic is 56 GB which I'd say would probably be another 150 gb if the scraper uses his previous compression schemes.
Yeah, there's an initial 17GB .zip archive, and a bunch of .7z archives that extract to around 4x as much space as the archive itself.
 
Now I can't read some fanfics but I can read a few others. Reviews are a no show either.
 
Yeah, the throttling kinda sucks.

I've considered the torrent but... the privacy aspect with clearnet torrents always seem iffy. And Internet Archive's torrents are not-too-rarely broken anyway too.

Would this be the Ao3 dump?

Yes. That would be the dump.

And FWIW, if your worried about privacy, use a no logs, no PII req VPN paid for in crypto. What I dunno is why you're even bothered. Like, maybe if your IP is packet sniffing and inspecting your data traffic so they can throttle your torrents. But downloading datasets isn't something that's going to get the bobby knocking on your door. It's a big case of 'who cares', and at least for some of the fanfic archives, there's actually a small seedbase that is infinitely faster than the internet archive's DD in addition to reducing load on the IA's bandwidth.

Stuff like this is pretty much exactly what torrents are made for; Public hosting and transfer of large files. My internet is pretty bad, but even if I throttle you at 100 kb/s, I'm still going to be feeding you files at a faster rate than the IA is going to allow you to get.

Topping all this off, as the files you are transfering are actually legal, there's no reason not to permaseed unless you're already permantly using that bandwidth for something. If you've bought it, and you aren't using it, that's wasted money.
 
And FWIW, if your worried about privacy, use a no logs, no PII req VPN paid for in crypto.
In truth, there's no verifying that any such thing exists (and even if they were to exist, netflow & other network analysis data useful for deanonymization without cooperation from ISPs is readily commercially available and usually sold as for "threat analysis" or similar things) and obtaining privacy coins has gotten obnoxious with the KYC bullshit (it also represents an undesirable monetary hurdle to anonymity & privacy).

Also some countries penalize even written lolicon/shotacon content (which an Ao3 dump is absolutely guaranteed to contain) like actual CP and seeding it would be considered (re)distribution (yes it's stupid, no they don't care). It's not just copyright corposcum that make the clearnet dangerous. Is it likely for one to get hit for that? Probably not until you annoy someone (doxing can get nasty), but still a risk.

I also never did say that the torrent protocol wasn't preferable, it is preferable, just not on the clearnet.
 
Also some countries penalize even written lolicon/shotacon content (which an Ao3 dump is absolutely guaranteed to contain) like actual CP and seeding it would be considered (re)distribution (yes it's stupid, no they don't care). It's not just copyright corposcum that make the clearnet dangerous. Is it likely for one to get hit for that? Probably not until you annoy someone (doxing can get nasty), but still a risk.

If someone is at risk for this, then they shouldn't be seeding it.

In truth, there's no verifying that any such thing exists (and even if they were to exist, netflow & other network analysis data useful for deanonymization without cooperation from ISPs is readily commercially available and usually sold as for "threat analysis" or similar things) and obtaining privacy coins has gotten obnoxious with the KYC bullshit (it also represents an undesirable monetary hurdle to anonymity & privacy).

As for this, look, I'm not stupid. A VPN isn't some magical cure-all. The point isn't anonymity. The point is privacy. I don't care if someone maybe knows I downloaded X, Y, or Z. I care about someone being able to prove it. I have VPN's I've paid for in monero, and the email I put in was- well, I'm not going to tell you, but lets say it wasn't a standard gmail address. What does this mean?

Well, they can't trace it back to the bank account. They can't trace it back to an email that they can prove I own. Theoretically, my VPN shouldn't be keeping logs. So who's to say that I'm really the one who was using that particular shared IP during the times they claim I did? Who's to say that I wasn't hacked? Who's to say someone from russia wasn't bouncing their traffic through my router?

I don't care if they know and I know it. I care about being able to look people in the eye, lift my chin and say "Alright. Prove it."

And privacy coins. It's the exact same thing. I purchase my crypto with cash on a KYC exchange. Why? Because it's easy, and I don't give a damn if they know I bought X, Y, or Z. I don't need complete anonymity. I require privacy. Who cares if they know I bought X crypto. I care that when I'm done sloshing that crypto around after it was transfered into the throwaway address I keep specifically for KYC contamination for that particular exchange, no one but me and the person I gave crypto to know where it's gone.

Unless you live somewhere where the government is literally going to give you a beating for so much as owning crypto, complete anonymity is a meme, and not really necessary. What you need is a reasonable doubt that means the difference between time in the slammer and life as a free man.

I also never did say that the torrent protocol wasn't preferable, it is preferable, just not on the clearnet.

Look, if you're at the point where a state level agency is looking to track you the way you seem worried they are, then you need an entirely different set of threat models. If you don't have a second burner computer, you've already lost at this point. You just don't know it. Hell, if you're dealing with that kind've threat model where downloading a .txt archive is going to get you hit with CSAM charges, then you probably have no buisiness talking about it here if if you have any intention to download it.
 
complete anonymity is a meme, and not really necessary.
We're drifting off-topic, but nothing I suggested is complete anonymity. Even the far more onerous measures I could suggest would just be a lot harder to break, not impossible.

The set of those threatened by extremely-visible (not just to one's ISP(s), route & endpoint, but to various other parties, like torrent peers) upload in cleartext or cleartext-equivalent (RC4 encryption in torrents is worthless and many organizations have dedicated themselves to keeping tabs on public torrents) torrents on the clearnet is much larger than the set of those who have to worry that TLS-protected batch HTTP transfers will be broken or dangerous in some way.

But then that's subject to woeful throttling limitations, so perhaps just fixing torrents' privacy issues (which in this case requires anonymization to guarantee to a reasonable degree without unnecessary fees & trust) would be enough? And that's essentially what I suggested. It is indeed also not an option (or not a sufficient one anyway) in regimes where you essentially don't have a Right to Silence (and so encryption).

----

On a somewhat more on-topic note, the variation in export format between the dumps is a bit unfortunate. No one-size-fits-all approach possible here. Even chapter-detection will have to be per-origin-site.
 

Users who are viewing this thread

Back
Top