These PDF files were gathered from several issue trackers. We used APIs for bugzilla- and JIRA-based bug trackers, and did straight html scraping for github-based trackers. We also wrote a fully custom crawler for chromium. For bugzilla-based trackers, we ran queries for issues that contained an attachment with a mime type including the word 'application'. This was a broad reach query. We then gathered all attachments from those issues. After downloading the files, we identified compressed (e.g. *.bz2) files and package files (e.g. *.zip files). We wrote code to extract/decompress those files (including the .tgz combo). These are files named, for example: OOO-5860-10.zip-0.pdf. For the compressed/packaged files, we used Apache Tika for file identification and to attach a file extension. Stressful PDF Corpus We wrote code to change the file extension of the original files (via Apache Tika). We discovered a fairly large list of file types that need to be added to Tika (mostly specializations of .xml or .zip), and we created a deny list so that we wouldn't overwrite those file extensions on that list. For github-based sites, we found that users frequently supplied a url to an external file rather than or in addition to actually attaching a file. For those files, we tried to download the file from the external url, and we used Tika to modify the file extension. These files are named MOZILLA-LINK-3185-0.pdf...that's the first valid external url for mozilla issue 3185. For all files, we changed the filesystem "last modified" date to the date the issue was opened to give a sense of the age of the file -- at some point we might change this to "uploaded date" where available. We removed 0-byte files, *.diff and *.txt files. The source code for all crawlers except the pdfium crawler is available here: https://github.com/tballison/tika-addons/tree/master/bugtracker-crawler In November 2020, we refreshed the crawl for all sites except for pdfium. Sites Bugzilla-based (using the standard rest-based API) MOZILLA https://bugzilla.mozilla.org/ REDHAT https://bugzilla.redhat.com/ OOO https://bz.apache.org/ooo POI https://bz.apache.org/bugzilla/ LIBRE_OFFICE https://bugs.documentfoundation.org/ GHOSTSCRIPT https://bugs.ghostscript.com/ Bugzilla-based html scraping (rest-based API is turned off) from https://bugs.freedesktop.org, products: cairo colord dejavu poppler Gitlab-based https://gitlab.freedesktop.org/poppler/poppler (stored in poppler-gitlab/) https://gitlab.freedesktop.org/cairo/cairo (stored in cairo-gitlab/) https://gitlab.gnome.org/GNOME/evince Github-based https://github.com/sumatrapdfreader/sumatrapdf https://github.com/mozilla/pdf.js https://github.com/qpdf/qpdf https://github.com/LibrePDF/OpenPDF https://github.com/jbarlow83/OCRmyPDF https://github.com/barryvdh/laravel-snappy https://github.com/pdfminer/pdfminer.six https://github.com/diegomura/react-pdf https://github.com/foliojs/pdfkit https://github.com/barteksc/AndroidPdfViewer https://github.com/tabulapdf/tabula https://github.com/tabulapdf/tabula-java https://github.com/libvips/libvips https://github.com/prawnpdf/prawn https://github.com/axa-group/Parsr https://github.com/pdfcpu/pdfcpu https://github.com/pikepdf/pikepdf On 8-9 March, 2021, we also crawled jpeg sites, including: https://github.com/libjpeg-turbo/libjpeg-turbo https://github.com/haraldk/TwelveMonkeys https://github.com/google/guetzli https://github.com/mozilla/mozjpeg https://github.com/tjko/jpegoptim https://github.com/lovell/sharp https://github.com/libvips/libvips https://github.com/dropbox/lepton https://github.com/SixLabors/ImageSharp https://github.com/drewnoakes/metadata-extractor https://github.com/contentful-labs/Concorde https://github.com/spatie/image-optimizer https://github.com/danielgtaylor/jpeg-archive JIRA-based https://issues.apache.org/jira/projects/COMPRESS https://issues.apache.org/jira/projects/FOP https://issues.apache.org/jira/projects/PDFBOX https://issues.apache.org/jira/projects/TIKA https://issues.apache.org/jira/projects/NUTCH https://ec.europa.eu/cefdigital/tracker/projects/DSS other pdfium https://bugs.chromium.org/p/pdfium/issues/list Known issues -- * We deleted REDHAT-894449-14.gz, that included a 78GB zip bomb * These files come straight from the internet. We've identified a handful of malicious documents, but there may be more. Let us know what you find! * Tika's file type detection is imperfect * There can be duplicates in attachments if there are different links to the same attachment but with different urls within an issue (no obv solution) or across issues * There can be duplicates in external links per issue (now fixed) and across issues * Still todo: ** Redhat is bugzilla-based but overwhelming... we must do more precise queries for the file types of interest ** git lab -- poppler ** Other OSS: PoDoFo, ? Future work * Need to modify code for incremental updates * Figure out how to balance files better into subdirectories --- This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.