[HIRING] Python Developer for WSJ PDF Article Extraction Project
This is a direct hiring post for a Python developer. The task involves building a script to process Wall Street Journal PDF editions, extracting and stitching together full articles and graphs, and then outputting them into a new, clean, and structured PDF.
Key Skills, Tools, and Qualifications:
- Programming Language: Python (as specified).
- PDF Manipulation Libraries: Experience with libraries like
PyPDF2
,pdfminer.six
(for text and layout analysis),ReportLab
(for generating the new PDF), andfitz
(PyMuPDF) is essential. - Text Extraction: Ability to accurately extract text content from PDFs, including handling multi-column layouts and articles that span multiple pages.
- Layout Analysis: Understanding or ability to develop logic to identify article boundaries, differentiate between articles, advertisements, and other content. This also includes identifying and extracting graphs.
- Data Structuring: Skills to organize the extracted content (articles, graphs) into a structured format before generating the new PDF.
- OCR (Optical Character Recognition): Potentially, if some PDF content is image-based or not easily extractable, OCR tools (e.g., Tesseract with a Python wrapper like
pytesseract
) might be needed. - Problem-Solving: Newspaper layouts can be complex and inconsistent; strong problem-solving skills are essential.
Job Opportunity:
- Role: Python Developer (Freelance/Contract).
- Project: Develop a specialized script for automated PDF content extraction and re-compilation from Wall Street Journal editions.
Resume Submission Directions:
- Directly contact the original poster (OP) of the Reddit thread.
- Provide a resume and, more importantly, a portfolio or examples of previous relevant work.
- Highlight specific experience with Python, PDF processing, text extraction, data parsing, and any projects involving complex document analysis or manipulation. Mention familiarity with the suggested libraries if applicable.
Expected Benefits:
- Financial Compensation: This is a paid hiring opportunity (likely project-based or hourly rate for a contract).
- Portfolio Piece: A successful project would serve as a strong, specific example of advanced PDF manipulation and data extraction skills.
- Skill Enhancement: Opportunity to tackle a challenging real-world data extraction problem, potentially deepening expertise in Python and PDF processing.
- Direct Impact: Creating a tool that directly solves a specific problem for the client.
Origin Reddit Post
r/forhire
[Hiring] Python Developer to Extract and Stitch Full Articles from WSJ PDF Layouts
Posted by u/vernacular-ai•06/03/2025
I’m looking to hire a Python developer to help build a script that processes *Wall Street Journal* (WSJ) PDF editions and outputs a clean, structured PDF with only articles and graphs. I have
Top Comments
u/GlitteringStomach448
Working on it right now !