[HIRING] Python Developer for WSJ PDF Article Extraction Project

Published on 06/03/2025Hiring & Talent Acquisition Insights

This is a direct hiring post for a Python developer. The task involves building a script to process Wall Street Journal PDF editions, extracting and stitching together full articles and graphs, and then outputting them into a new, clean, and structured PDF.

Key Skills, Tools, and Qualifications:

  • Programming Language: Python (as specified).
  • PDF Manipulation Libraries: Experience with libraries like PyPDF2, pdfminer.six (for text and layout analysis), ReportLab (for generating the new PDF), and fitz (PyMuPDF) is essential.
  • Text Extraction: Ability to accurately extract text content from PDFs, including handling multi-column layouts and articles that span multiple pages.
  • Layout Analysis: Understanding or ability to develop logic to identify article boundaries, differentiate between articles, advertisements, and other content. This also includes identifying and extracting graphs.
  • Data Structuring: Skills to organize the extracted content (articles, graphs) into a structured format before generating the new PDF.
  • OCR (Optical Character Recognition): Potentially, if some PDF content is image-based or not easily extractable, OCR tools (e.g., Tesseract with a Python wrapper like pytesseract) might be needed.
  • Problem-Solving: Newspaper layouts can be complex and inconsistent; strong problem-solving skills are essential.

Job Opportunity:

  • Role: Python Developer (Freelance/Contract).
  • Project: Develop a specialized script for automated PDF content extraction and re-compilation from Wall Street Journal editions.

Resume Submission Directions:

  • Directly contact the original poster (OP) of the Reddit thread.
  • Provide a resume and, more importantly, a portfolio or examples of previous relevant work.
  • Highlight specific experience with Python, PDF processing, text extraction, data parsing, and any projects involving complex document analysis or manipulation. Mention familiarity with the suggested libraries if applicable.

Expected Benefits:

  • Financial Compensation: This is a paid hiring opportunity (likely project-based or hourly rate for a contract).
  • Portfolio Piece: A successful project would serve as a strong, specific example of advanced PDF manipulation and data extraction skills.
  • Skill Enhancement: Opportunity to tackle a challenging real-world data extraction problem, potentially deepening expertise in Python and PDF processing.
  • Direct Impact: Creating a tool that directly solves a specific problem for the client.

Origin Reddit Post

r/forhire

[Hiring] Python Developer to Extract and Stitch Full Articles from WSJ PDF Layouts

Posted by u/vernacular-ai06/03/2025
I’m looking to hire a Python developer to help build a script that processes *Wall Street Journal* (WSJ) PDF editions and outputs a clean, structured PDF with only articles and graphs. I have

Top Comments

u/GlitteringStomach448
Working on it right now !

Ask AI About This

Get deeper insights about this topic from our AI assistant

Start Chat

Create Your Own

Generate custom insights for your specific needs

Get Started