The full "Stressful PDF Corpus" downloads (also referred to as the "Issue Tracker Corpus") is available from this webpage as a set of 6 tar.gz
(.tgz
) files.
It is documented at the "Stressful PDF Corpus" resource on the PDF Association website and was publicly announced in Nov 2020 with the "Stress PDF corpus grows" article.
The BACKGROUND.txt file provides more detailed background information.
This collection of PDF files may contain malicious files. Beware!
For the Nov 2020 crawl, we increased the number projects dramatically over the Feb 2020 crawl. Note that we did not re-crawl pdfium in the Nov 2020 crawl. The PDFs are batched by project as follows:
./pdfs_202011/batch1.tgz/ (SHA-512 hash)
./pdfs_202011/batch2.tgz/ (SHA-512 hash)
./pdfs_202011/batch3.tgz/ (SHA-512 hash)
./pdfs_202011/batch4.tgz/ (SHA-512 hash)
./pdfs_202011/batch5.tgz/ (SHA-512 hash)
./pdfs_202011/batch6.tgz/ (SHA-512 hash)
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.