Pull data from external sources without building scrapers from scratch — websites, PDFs, SEC filings, and U.S. Census geographies.
Functions
FetchWebsiteText
Scrape and clean the text content of any public webpage.
from analysistoolbox.data_collection import FetchWebsiteText
text = FetchWebsiteText(url="https://example.com/article")
ExtractTextFromPDF
Extract clean text from a local or remote PDF document.
from analysistoolbox.data_collection import ExtractTextFromPDF
text = ExtractTextFromPDF(file_path="report.pdf")
FetchPDFFromURL
Download a PDF from a URL to a local path.
GetCompanyFilings
Access SEC EDGAR filings programmatically — 10-Ks, 10-Qs, 8-Ks, and more.
from analysistoolbox.data_collection import GetCompanyFilings
filings = GetCompanyFilings(
company_name="Apple Inc.",
filing_type="10-K"
)
GetGoogleSearchResults
Fetch Google search results via the Serper API. Requires a SERPER_API_KEY environment variable.
FetchUSShapefile
Retrieve U.S. Census TIGER shapefiles for states, counties, tracts, or congressional districts.
from analysistoolbox.data_collection import FetchUSShapefile
gdf = FetchUSShapefile(geography="county", state="Virginia")
GetZipFile
Download and extract a ZIP archive from a URL.
Use cases
- Competitive intelligence from public web sources
- Financial analysis from SEC filings
- Document processing pipelines
- Geospatial analysis with Census boundaries