Investigative Journalism Online Tools Directory
Investigative Tools

Investigative Journalism Online Tools Directory

Skip to main content
< All Topics
Print

Investigative Journalism Online Tools Directory

Purpose: Comprehensive catalog of features, functions, and tools used by investigative journalists and OSINT practitioners — organized as input for the Open Semantic Search major upgrade development plan. Part I covers investigative journalism platforms and workflows. Part II covers the broader Open Source Intelligence (OSINT) resource ecosystem. Part III provides analysis and upgrade planning.

Date: April 2026

Part I — Investigative Journalism Platforms & Workflows


1. Document Ingestion & Processing

The foundation of every investigative project is getting documents into a searchable, analyzable state.

Core Functions

Function Description Tools
Multi-format ingestion Accept PDFs, Word, Excel, email (.eml/.pst), images, HTML, archives (.zip/.tar), audio, video Apache Tika, Datashare, Aleph, DocumentCloud
OCR (optical character recognition) Extract text from scanned documents, images, embedded images in PDFs Tesseract, Google Cloud Vision, Amazon Textract, DocumentCloud AI OCR
Handwriting recognition Transcribe handwritten notes and annotations Google Pinpoint, Azure AI
Audio/video transcription Convert speech to searchable text Google Pinpoint (15 languages, up to 2hr files), Whisper, AssemblyAI
Email parsing Extract headers, body, attachments, thread structure from email archives Datashare, Aleph, Open Semantic ETL
Table extraction Convert scanned/PDF tables into structured spreadsheets Google Pinpoint, Camelot, Tabula, Arkham Mirror (vision-based)
Metadata extraction Pull EXIF, document properties, author info, creation dates, GPS coordinates ExifTool, FOCA, Metagoofil, Apache Tika
Batch processing Ingest thousands/millions of documents with queue management Datashare (CLI mode), Aleph, Open Semantic ETL (Celery/RabbitMQ)
Incremental crawling Re-scan sources and ingest only new/changed documents Open Semantic Search (cron), Aleph crawlers
Archive decompression Recursively unpack nested archives (.zip, .tar.gz, .rar, .7z) Apache Tika, Aleph

Reference Implementations

  • ICIJ Datashare: Tika + Tesseract + CoreNLP pipeline; local-first; CLI mode for large-scale batch processing
  • Open Semantic Search: ETL framework with Celery task queue, RabbitMQ broker, Tika server, Tesseract cache
  • OCCRP Aleph: Ingestors for structured (CSV, SQL) and unstructured (documents) data; FtM entity mapping

2. Search & Discovery

How journalists find needles in document haystacks.

Core Functions

Function Description Tools
Full-text search Keyword search across all indexed content Solr, Elasticsearch, Aleph, Datashare
Boolean operators AND, OR, NOT, grouping with parentheses All major search platforms
Wildcard / fuzzy search Pattern matching, spelling tolerance Solr (edit distance), Elasticsearch
Proximity search Find terms within N words of each other Solr, Aleph
Phrase search Exact phrase matching with quotes All major platforms
Faceted search / interactive filters Navigate by author, date, entity, file type, language, tag Open Semantic Search, Datashare, Aleph
Semantic search Meaning-based search beyond keyword matching Trove (RAG), Arkham Mirror (embeddings), IntellyWeave
Natural language queries Ask questions in plain English, get cited answers Trove, Google Pinpoint, Presswork.ai
Search-by-list Upload a list of names/terms, find all matching documents Open Semantic Search, Datashare (batch search API)
Date range filtering Restrict results to specific time periods All major platforms
Saved searches / alerts Save queries and get notified of new matches Aleph, Datashare
Search within results Progressively narrow result sets Open Semantic Search (facet drilldown)
Cross-collection search Search across multiple document collections simultaneously Aleph (multi-dataset), Datashare (server mode)
Relevance ranking Score and sort results by relevance, with tuning controls Solr (boosts), Elasticsearch, Open Semantic Search
Document preview / snippets Show matching text in context without opening full document All major platforms
Similar document discovery “More like this” — find related documents Solr MLT, Elasticsearch
Synonym / thesaurus expansion Automatically include synonyms and related terms Open Semantic Search (SKOS thesaurus)

3. Entity Extraction & NLP

Automatically identifying people, organizations, places, and relationships within documents.

Core Functions

Function Description Tools
Named Entity Recognition (NER) Extract persons, organizations, locations spaCy, CoreNLP, GLiNER, Stanza
Email address extraction Find and catalog email addresses in text Datashare, theHarvester
Phone number extraction Identify phone numbers across formats Custom regex, NER models
Financial entity extraction Identify monetary amounts, account numbers, transactions Trove, Arkham Mirror
Date/time extraction Normalize dates across formats and languages SUTime (CoreNLP), Duckling
Address extraction Parse physical addresses, normalize formatting Libpostal, custom NER
Custom entity types Define domain-specific entities (laws, case numbers, cryptonyms) IntellyWeave (GLiNER), spaCy custom models
Entity linking / disambiguation Connect extracted entities to known databases (Wikidata, etc.) Open Semantic Entity Search API, spaCy entity linker
Coreference resolution Connect “he,” “the company,” “the defendant” to specific entities CoreNLP, NeuralCoref
Multilingual NER Entity extraction across 40+ languages spaCy multilingual models, New/s/leak 2.0, Stanza
Sentiment / tone analysis Detect emotional tone in communications Talkwalker, custom models
Topic modeling Auto-discover themes across document collections LDA, BERTopic
Language detection Automatically identify document language langdetect, fastText, Tika
Keyword / keyphrase extraction Identify most significant terms per document YAKE, KeyBERT, TF-IDF
Manual entity annotation Human correction and addition of entities Open Semantic Search (tagger), Datashare (star/tag), Aleph

4. Network Analysis & Link Visualization

Revealing hidden relationships between people, companies, and events.

Core Functions

Function Description Tools
Entity relationship mapping Visualize connections between people, orgs, accounts Maltego, i2 Analyst’s Notebook, Aleph
Network graph visualization Interactive node-link diagrams Neo4j Browser, Cytoscape.js, Gephi, Sigma.js
Link analysis Identify shortest paths, clusters, central nodes Maltego, i2, Gephi
Timeline visualization Plot events chronologically to reveal patterns Aleph, Trove, TimelineJS, i2
Geographic mapping of networks Overlay network data on maps Maltego, IntellyWeave (Mapbox), Aleph
Automated connection discovery AI-driven detection of non-obvious links Maltego transforms, Trove
Cluster detection Identify groups/communities within networks Gephi (modularity), Neo4j (community detection)
Influence / centrality scoring Rank nodes by importance (betweenness, PageRank) Gephi, Neo4j, NetworkX
Temporal network analysis Track how networks evolve over time Gephi (dynamic), custom tools
Cross-dataset entity matching Match entities across different databases Aleph (cross-referencing), OpenRefine (reconciliation)
Export / publish graphs Share visualizations for publication or collaboration Maltego, Gephi (SVG/PDF), Aleph (embed)

Reference Implementations

  • Maltego: 200+ data source transforms; 200M+ company records; 1B+ online identities
  • i2 Analyst’s Notebook: 30+ year industry standard; drag-and-drop; team collaboration
  • OCCRP Aleph: FollowTheMoney (FtM) data model; YAML entity mapping; network diagrams built into platform
  • Open Semantic Search: Neo4j graph database integration; Cytoscape.js visualization

5. Knowledge Graphs & Structured Data

Organizing investigative knowledge into queryable, connected structures.

Core Functions

Function Description Tools
Ontology / thesaurus management Define and maintain controlled vocabularies Open Semantic Search (SKOS/RDF), Protégé
RDF / linked data Represent knowledge as machine-readable triples Open Semantic Search, Apache Jena
Graph database storage Store entities and relationships as nodes/edges Neo4j, ArangoDB, JanusGraph
SPARQL / Cypher queries Query knowledge graphs with graph query languages Neo4j (Cypher), Apache Jena (SPARQL)
FollowTheMoney (FtM) Standardized entity model for investigative data Aleph, OpenSanctions
Schema.org structured data Standard vocabularies for web-published findings Schema.org, JSON-LD
Entity deduplication Merge duplicate entity records Aleph, OpenRefine, Dedupe.io
Taxonomy tagging Classify documents by topic/category hierarchies Open Semantic Search, custom taxonomies
Inference / reasoning Derive new facts from existing knowledge OWL reasoners, Neo4j GDS
Knowledge graph visualization Interactive exploration of entity networks Neo4j Browser, Open Semantic Search (graph explorer)

6. Corporate & Financial Intelligence

Following the money and mapping corporate structures.

Core Functions

Function Description Tools
Company registry search Look up company records across jurisdictions OpenCorporates (140+ registries), Companies House
Beneficial ownership lookup Identify ultimate owners behind corporate veils OpenSanctions, ICIJ Offshore Leaks, national BO registries
Sanctions screening Check entities against global sanctions lists OpenSanctions (2.1M+ entities, 328 sources), OFAC, EU consolidated list
PEP (Politically Exposed Person) screening Identify politically connected individuals OpenSanctions PEP data, Dow Jones, World-Check
Corporate hierarchy mapping Visualize parent-subsidiary-affiliate structures OpenCorporates, Aleph, Maltego
Director / officer cross-referencing Find shared directorships across companies OpenCorporates, Maltego
Financial transaction tracing Follow money flows across accounts and entities Trove (financial extraction), Chainalysis (crypto)
Property / land records Search real estate ownership databases County recorder databases, Zillow, Regrid
Court records / litigation Search case filings and court documents PACER, CourtListener, RECAP
Offshore leak databases Search Panama Papers, Paradise Papers, Pandora Papers ICIJ Offshore Leaks Database
Lobbyist / political donation databases Track political influence and money OpenSecrets, FEC, state lobbying databases
Tax haven / jurisdiction analysis Identify structures in low-transparency jurisdictions Tax Justice Network, ICIJ resources

Reference Implementations

  • OpenSanctions: Free for investigative use; API for bulk matching; reconciliation endpoints
  • OCCRP Aleph: Pre-loaded with sanctions, corporate registries, leak databases
  • ICIJ Offshore Leaks: 810K+ offshore entities searchable online

7. Geolocation & Mapping

Placing events, people, and evidence in geographic context.

Core Functions

Function Description Tools
Satellite imagery analysis Examine high-res imagery for physical evidence Google Earth Pro, Planet, Maxar, Airbus (via Apollo Mapping)
Historical satellite comparison Compare imagery over time for changes Google Earth (timelapse), Sentinel Hub
Street-level verification Verify locations using panoramic street imagery Google Street View, Mapillary, KartaView
Sun/shadow analysis Determine time/date from shadow positions SunCalc, ShadowMap
Geolocation from photos Identify where a photo was taken from visual clues GeoHints, Google Lens, Bellingcat geolocation tools
WiFi network geolocation Map WiFi access point locations WiGLE
Social media geolocation Discover geotagged posts from specific areas Social Geo Lens, Creepy
Custom map creation Plot investigation data on interactive maps Mapbox, Leaflet, Google My Maps, QGIS
EXIF GPS extraction Pull GPS coordinates from photo metadata ExifTool, Jeffrey’s EXIF Viewer
Geofencing / area monitoring Track activity within defined geographic areas Custom tools, social media monitoring
3D terrain analysis Analyze topography for line-of-sight, elevation Google Earth Pro, Cesium

8. Image & Video Verification

Detecting manipulated, AI-generated, or misattributed visual media.

Core Functions

Function Description Tools
Reverse image search Find original source and prior uses of an image Google Lens, TinEye, Yandex Images, Bing Visual Search
Video verification Analyze video authenticity, extract keyframes InVID Verification Plugin, YouTube DataViewer (Amnesty)
Image forensics Detect cloning, splicing, JPEG compression artifacts Forensically, FotoForensics, Error Level Analysis
AI-generated image detection Identify GAN/diffusion-generated fake images AmIReal, Hive AI Detection, Illuminarty
Deepfake detection Identify AI-manipulated video/audio Azure AI Video Indexer, Sensity, Reality Defender
Metadata analysis Examine EXIF, IPTC, XMP data for authenticity clues ExifTool, Jeffrey’s EXIF Viewer
Chronolocation Determine when a photo/video was taken SunCalc (shadow analysis), weather correlation
Logo / object identification Identify organizations, equipment, weapons in images Google Lens, CamFind, military identification databases
Facial detection Detect faces in video for indexing (not biometric ID) Azure AI Video Indexer

9. Social Media Intelligence

Monitoring, collecting, and analyzing social media for investigative leads.

Core Functions

Function Description Tools
Platform monitoring Track mentions, hashtags, accounts across platforms Talkwalker, CrowdTangle, Meltwater
Historical data retrieval Access deleted/archived social media posts Wayback Machine, Archive.today, cached versions
Twitter/X analysis Historical tweet extraction, network mapping Various OSINT tools (post-API restrictions)
Telegram channel search Index and search Telegram channels/groups Telegago, TGStat
Facebook/Instagram research Profile analysis, group monitoring CrowdTangle (Meta), Who Posted What
Google account profiling Discover accounts linked to a Google account GHunt
Username enumeration Find accounts across platforms by username Sherlock, Namechk, KnowEm
Profile archiving Save social media profiles before deletion Hunchly, Auto Archiver (Bellingcat)
Sentiment / narrative tracking Track how narratives spread across platforms Talkwalker, Brandwatch
Bot detection Identify automated/inauthentic accounts Botometer, custom analysis
Influence network mapping Map follower/following networks and amplification Maltego, Gephi, custom scrapers

10. Web Archiving & Evidence Preservation

Capturing and preserving digital evidence with chain of custody.

Core Functions

Function Description Tools
Webpage snapshots Save point-in-time copies of web pages Archive.today, Wayback Machine
Automatic capture Record every page visited during investigation Hunchly
Batch archiving Archive multiple URLs programmatically Bellingcat Auto Archiver, ArchiveBox
Hash verification Cryptographic hash of captured content for integrity Hunchly (SHA-256), Auto Archiver
Timestamp certification Provable timestamp of when content was captured Hunchly, blockchain timestamping
Screenshot capture Visual preservation of page appearance Archive.today (PNG), Hunchly (full-page)
Social media archiving Capture posts, profiles, comments before deletion Auto Archiver, Hunchly, Archive.today
PDF generation Convert web pages to archival PDF format SingleFile, print-to-PDF, custom tools
Perceptual hashing Detect near-duplicate content across archives Auto Archiver
Chain of custody documentation Audit trail of who captured what and when Hunchly, Auto Archiver
Evidence packaging Generate court-ready or publication-ready evidence bundles Hunchly

Reference Implementations

  • Hunchly: Commercial; automatic capture, tagging, audit trails, court-ready exports
  • Bellingcat Auto Archiver: Open-source; 150K+ pages preserved; batch processing; perceptual hashing
  • Archive.today: Free; preserves JS-rendered content; persistent URLs

11. FOIA & Public Records

Filing, tracking, and analyzing government records requests.

Core Functions

Function Description Tools
FOIA request filing Submit public records requests to government agencies MuckRock (23K+ agencies), iFOIA
Request tracking Monitor status of pending requests MuckRock, custom tracking
Agency database Directory of government agencies and their FOIA contacts MuckRock, FOIA.gov
FOIA log search Search existing FOIA logs for prior requests MuckRock FOIA Log Explorer (170K+ requests)
Response management Track incoming documents and fee payments MuckRock
Appeal templates Generate appeals for denied or inadequate responses MuckRock, RCFP resources
Document hosting Publish received documents publicly DocumentCloud (6.9M+ public documents)
Document annotation Annotate key passages in public records DocumentCloud, Datashare
Bulk embedding Embed documents in news articles DocumentCloud embed API

12. Secure Communication & Whistleblower Intake

Protecting sources and handling sensitive materials.

Core Functions

Function Description Tools
Anonymous document submission Allow sources to upload files anonymously SecureDrop, GlobaLeaks
End-to-end encryption Encrypt communications between source and journalist Signal, SecureDrop (GPG)
Tor-based anonymity Route communications through onion network SecureDrop (.onion), Tor Browser
Air-gapped viewing View sensitive documents on non-networked machines SecureDrop Secure Viewing Station
Metadata stripping Remove identifying metadata from documents MAT2, ExifTool, SecureDrop
Encrypted storage Store documents with at-rest encryption VeraCrypt, LUKS, SecureDrop
Source communication Two-way messaging with anonymous sources SecureDrop, Signal
Organization-owned infrastructure No third-party servers; full organizational control SecureDrop (on-premises)

13. Collaboration, Annotation & Redaction

Working in teams on sensitive investigations.

Core Functions

Function Description Tools
Shared investigation workspaces Private team spaces for investigative projects Aleph (investigations), DocumentCloud, Trove
Document annotation Highlight, comment, and mark up documents DocumentCloud (notes), Aleph, Hypothesis
Entity bookmarking Save and organize entities of interest Aleph (lists), Datashare (stars/tags)
Document tagging Categorize documents with custom tags Datashare, DocumentCloud, Aleph
Permanent redaction Irrecoverably remove sensitive information Redactable, DocumentCloud, Adobe Acrobat
AI-powered PII detection Automatically find personal information for redaction Redactable, Orson AI
Entity anonymization Replace real names with placeholders during collaboration Orson AI
Redaction audit trail Log who redacted what and when Redactable (certificates), Orson AI
Access control / permissions Role-based access to investigation materials Aleph, Datashare (server mode), DocumentCloud
Version control Track document changes and annotation history DocumentCloud, Git-based workflows
Export / publication Generate publication-ready document packages DocumentCloud (embed), Aleph, Hunchly

14. Data Cleaning & Record Linkage

Preparing messy real-world data for analysis.

Core Functions

Function Description Tools
Data cleaning Fix inconsistencies, normalize formats, handle missing values OpenRefine, pandas, Excel/Sheets
Clustering / deduplication Group and merge similar records OpenRefine (key collision, nearest neighbor)
Reconciliation Match local data against external databases (Wikidata, etc.) OpenRefine reconciliation API
Data transformation Convert between formats (CSV, JSON, XML, SQL) OpenRefine, jq, csvkit
Faceting Explore data distributions by text, numeric, date facets OpenRefine
Regular expression extraction Pattern-based data extraction from text OpenRefine, grep, regex tools
Spreadsheet analysis Pivot tables, formulas, statistical analysis Excel, Google Sheets, Tableau
Data visualization Charts, graphs, dashboards Tableau, Flourish, Datawrapper, D3.js
Geocoding Convert addresses to coordinates and vice versa Google Geocoding API, Nominatim
Date normalization Parse and standardize dates across formats dateutil, Duckling, custom parsers

15. Transportation Tracking

Monitoring movements of aircraft, ships, and vehicles.

Core Functions

Function Description Tools
Flight tracking (ADS-B) Real-time and historical aircraft position data ADS-B Exchange, Flightradar24, FlightAware
Aircraft registration lookup Identify owners of specific aircraft FAA Registry, national aviation databases
Ship tracking (AIS) Real-time vessel position and voyage data MarineTraffic, VesselFinder
AIS gap detection Identify vessels turning off transponders (sanctions evasion) RadianceFleet
Maritime anomaly detection Flag suspicious vessel behavior patterns RadianceFleet, Phantom Tide
Sanctions vessel cross-referencing Match tracked vessels against sanctions watchlists RadianceFleet, MarineTraffic
Vehicle plate recognition Track vehicles via license plate databases Varies by jurisdiction
Rail / freight tracking Monitor cargo shipments Open-source tools, carrier APIs
Cross-domain correlation Combine air, sea, and ground tracking data Phantom Tide

16. Cryptocurrency & Dark Web

Investigating digital financial crime and hidden networks.

Core Functions

Function Description Tools
Blockchain transaction tracing Follow cryptocurrency flows across wallets Chainalysis Reactor, Crystal, Elliptic
Wallet identification Link wallets to real-world entities Chainalysis (134K+ counterparties)
Multi-chain analysis Trace across Bitcoin, Ethereum, and 27+ blockchains Chainalysis, Arkham Intelligence
Mixing / tumbling detection Identify obfuscated transactions Chainalysis (demixing), custom analysis
DeFi / smart contract analysis Trace swaps, bridges, and complex DeFi activity Chainalysis, Etherscan
Dark web monitoring Scan dark web marketplaces and forums Cloudburst, DarkOwl, Flashpoint
Dark web + crypto correlation Link dark web actors to blockchain activity Chainalysis + Cloudburst integration
Ransomware tracking Trace ransom payments and money laundering Chainalysis

17. AI/LLM-Powered Investigation

Emerging AI capabilities transforming investigative journalism.

Core Functions

Function Description Tools
RAG-powered document Q&A Ask natural language questions across document collections Trove (Claude/Gemini), Arkham Mirror, Presswork.ai
Semantic / embedding search Find conceptually related content beyond keyword matching Trove, Arkham Mirror, IntellyWeave (Weaviate)
AI-powered summarization Generate summaries of long documents or document sets Trove, Presswork.ai, Google Pinpoint
Contradiction detection Flag inconsistencies across documents Arkham Mirror
Multi-agent reasoning Multiple AI agents collaborating on complex analysis IntellyWeave (DSPy), multi-agent workflows
Automated timeline extraction Build event timelines from unstructured text Arkham Mirror, Trove
FOIA request generation AI-drafted public records requests Presswork.ai
Source CRM Track sources, contacts, and leads across investigations Presswork.ai
Claim extraction & verification Extract factual claims from text and assess verifiability Sonar Deep Research (Perplexity)
Evidence scoring Rank evidence by relevance, reliability, and provenance Sonar Deep Research, Trove
Multi-source aggregation & briefing Synthesize intelligence from diverse real-time feeds Crucix (27 data sources), GDELT, ACLED
Hypothesis-driven investigation AI-guided exploration starting from investigative hypotheses IntellyWeave
Citation / provenance tracking Every AI answer linked back to source documents Trove, Sonar Deep Research


Part II — Open Source Intelligence (OSINT) Resources

The sections below catalog the broader OSINT tool ecosystem that investigative journalists draw from beyond their core document analysis platforms. These tools are used for identity research, infrastructure reconnaissance, threat intelligence, and domain-specific monitoring.


18. OSINT Frameworks & Aggregators

Master directories and curated collections that organize the OSINT landscape.

Tool Directories

Resource Maintainer Scope Access
OSINT Framework osintframework.com Interactive tree of OSINT tools by category Free, web-based
Bellingcat Online Investigation Toolkit Bellingcat 11 categories; downloadable as CSV Free
Awesome OSINT GitHub community (220+ contributors) Curated GitHub list covering all OSINT domains Free
OSINT Directory osintdirectory.com 540+ tools across 15 categories with tutorials Free
OSINT Tools Library OSINT Newsletter Practitioner-curated, reliability-tested tools Free
Cipher387 OSINT Stuff Tool Collection cipher387 1,000+ tools with descriptions Free
OSINTBench osintbench.com Categorized tools with ratings and comparisons Free
Legendary OSINT K2SOsint (GitHub) Dark web, malware, phishing, automation focus Free
OSINT Bible frangelbarrera (GitHub) 33 specialized categories with methodology guides Free
Worldwide OSINT Tools Map cybdetective.com 614+ services organized by country on interactive map Free

Key Directory Categories (Bellingcat taxonomy)

Category Focus Area
Maps & Satellites Google Earth, Planet, Maxar, Sentinel Hub
Geolocation OpenStreetMap, GeoHints, SunCalc
Image/Video Google Lens, InVID, Forensically
Social Media Platform-specific analysis tools
People Sherlock, Maigret, WhatsMyName
Websites Wayback Machine, IntelX, DomainTools
Companies & Finance EDGAR, OpenCorporates, sanctions lists
Conflict ACLED, LiveUAMap, munitions databases
Transport FlightAware, Flightradar24, MarineTraffic
Environment & Wildlife Global Forest Watch, Global Fishing Watch, NASA FIRMS
Archiving Auto Archiver, Archive.today, Hunchly

19. People & Identity Investigation

Tools for researching individuals — from username enumeration to facial recognition.

Username & Account Discovery

Tool Description Coverage License
Sherlock Search for usernames across social networks 400+ platforms Open Source
SherlockOSINT Web-based username search 700+ platforms Free
Maigret Advanced username dossier collection with profile parsing 3,000+ sites (500 default) Open Source
WhatsMyName Username enumeration with community-maintained site list 600+ sites Open Source
Namechk Username and domain availability checker 100+ platforms Free
KnowEm Username search across social networks and domains 500+ platforms Free
Blackbird Fast username search with OSINT-focused output 500+ sites Open Source

People Search Engines

Tool Description Coverage License
Pipl Deep people search engine (professional/commercial) Global identity data Commercial
ThatsThem Reverse lookups by name, email, phone, IP, address US-focused Free tier
Spokeo Aggregated people search (public records, social, property) US Commercial
BeenVerified Background checks and people search US Commercial
Max Intel 72 free OSINT tools including people search, ghost finder Global Free
Social Catfish Reverse image, phone, email, and name search US/Global Commercial

Facial Recognition & Image-Based People Search

Tool Description Notes
PimEyes Facial recognition search engine; finds face matches across the web Commercial; controversial privacy implications
FaceCheck.ID Reverse face search engine Commercial
Search4faces Facial recognition search across VK and Odnoklassniki Free; Russian social networks
Azure AI Video Indexer Facial detection in video (not identification) Microsoft; free tier available
Google Lens Visual search that can match faces to web appearances Free

Identity Verification & Analysis

Tool Description Use Case
Epieos Email and phone OSINT investigation Identify accounts linked to email/phone
GHunt Google account investigation from email Discover Google services, reviews, maps contributions
Creepy Geolocation information gathering from social media Map subject movements from geotagged posts
Social Geo Lens Discover public posts from specific geographic areas Location-based people discovery

20. Email Intelligence

Investigating email addresses — ownership, breach exposure, organizational mapping.

Core Functions

Function Description Tools
Email verification Check if an email address exists and is deliverable Hunter.io, Email Hippo, NeverBounce
Organizational email discovery Find email addresses for a domain/organization Hunter.io, RocketReach, Snov.io
Email-to-identity mapping Link an email to social accounts and real identity Epieos, Holehe, GHunt
Breach exposure check Check if email appears in known data breaches Have I Been Pwned, XposedOrNot, DeHashed
Email header analysis Parse email headers for origin, routing, and spoofing indicators MXToolbox, Google Admin Toolbox
Reverse email search Find social profiles and registrations from an email Epieos, Holehe, UserSearch
Domain email patterns Discover naming conventions used by an organization Hunter.io, theHarvester
Email infrastructure analysis Examine MX records, SPF, DKIM, DMARC configuration MXToolbox, dmarcian

Key Tools

Tool Description License
Hunter.io Email finder and verifier; maps organizational email structures Freemium (25 searches/mo free)
Holehe Check which services an email is registered on (80+ sites) Open Source
Epieos Email and phone OSINT — linked accounts, breach data Free
theHarvester Gather emails, subdomains, IPs from public sources Open Source (Kali)
Have I Been Pwned Email breach exposure database (12B+ compromised accounts) Free (API paid)
Snov.io Email finder, verifier, and outreach automation Freemium
RocketReach Professional email and phone number finder Commercial
MXToolbox Email server diagnostics, blacklist check, header analysis Free

21. Phone Number Intelligence

Tools for investigating phone numbers — carrier, owner, location, and linked accounts.

Core Functions

Function Description Tools
Carrier lookup Identify the carrier/operator for a phone number PhoneInfoga, NumLookup, Twilio Lookup
Reverse phone search Find the owner or identity behind a number Truecaller, Sync.me, ThatsThem
Number type identification Determine if number is mobile, landline, VoIP PhoneInfoga, Twilio Lookup API
Country/region identification Parse international number formatting and origin libphonenumber, PhoneInfoga
Caller ID aggregation Aggregate crowdsourced caller identification data Truecaller (7B+ numbers), Sync.me
Linked account discovery Find accounts registered with a phone number Epieos, Signal check, WhatsApp check
Spam/scam flagging Check if a number is reported as spam or fraud Truecaller, Should I Answer

Key Tools

Tool Description License
PhoneInfoga Advanced phone number scanner — carrier, type, region, linked accounts Open Source
Truecaller Global caller ID database (7B+ numbers, 400M+ users) Freemium
NumLookup Free reverse phone lookup Free
Sync.me Caller ID and reverse phone lookup Freemium
Twilio Lookup API Carrier and caller name lookup via API Pay-per-lookup
libphonenumber Google’s phone number parsing/formatting library Open Source

22. Domain, DNS & Infrastructure Intelligence

Investigating web infrastructure — domains, IP addresses, hosting, certificates, and network topology.

Core Functions

Function Description Tools
WHOIS lookup Domain registration details — owner, registrar, dates DomainTools, WHOIS.com, ViewDNS
Historical WHOIS Track domain ownership changes over time DomainTools, SecurityTrails
DNS record lookup A, AAAA, MX, NS, TXT, CNAME records DNSDumpster, DNSRecon, ViewDNS
Subdomain enumeration Discover all subdomains for a domain Subdominator (50+ sources), Amass, Subfinder
Certificate transparency search Find related domains sharing SSL certificates crt.sh, Censys
Reverse IP lookup Find other domains hosted on the same IP ViewDNS, Shodan, SecurityTrails
Technology profiling Identify web frameworks, CMS, analytics, hosting BuiltWith, Wappalyzer, WhatRuns
Internet-connected device search Discover exposed devices, services, vulnerabilities Shodan, Censys, ZoomEye, Fofa
Port scanning / service detection Identify open ports and running services Shodan, Censys, Nmap
IP geolocation Map IP addresses to physical locations MaxMind, ipinfo.io, IPVoid
ASN / BGP analysis Identify network ownership and routing Hurricane Electric BGP, RIPE Stat
Website change detection Monitor web pages for content changes Visualping, ChangeTower, Distill.io

Key Tools

Tool Description License
Shodan Internet-connected device search engine (3M+ users, 89% of Fortune 100) Freemium
Censys Internet-wide scan data — hosts, certificates, protocols Freemium
DNSDumpster Free domain research and DNS reconnaissance Free
crt.sh Certificate Transparency log search Free
DomainTools WHOIS, reverse WHOIS, domain history Commercial
SecurityTrails Historical DNS, WHOIS, subdomain intelligence Freemium
BuiltWith Technology profiling for websites Freemium
Wappalyzer Browser extension for technology detection Free
ViewDNS.info DNS lookup, reverse IP, WHOIS history, firewall detection Free
Amass In-depth subdomain enumeration and attack surface mapping Open Source (OWASP)
BBOT Recursive modular OSINT framework with 80+ modules Open Source
Subdominator Passive subdomain enumeration from 50+ sources Open Source
ZoomEye Cyberspace search engine (Chinese alternative to Shodan) Freemium
Fofa Cyberspace search and asset mapping Freemium

23. OSINT Search Engines & Dark Web Search

Specialized search engines that index content beyond the surface web.

Surface Web OSINT Search

Tool Description Special Capability
Intelligence X (IntelX) OSINT search for emails, domains, IPs, Bitcoin, files Historical data, leaked datasets, dark web content
Google Dorking Advanced Google operators for targeted search site:, filetype:, intitle:, inurl: operators
Yandex Russian search engine with strong image search Often returns results Google misses
DuckDuckGo Privacy-focused search with !bang shortcuts Useful for non-personalized results
Carrot2 Search results clustering engine Groups results by topic automatically
The OSINT Vault Multi-Search Launcher (80+ platforms from single query) Batch search across multiple engines
Cylect.io AI-powered OSINT search aggregator Aggregates multiple search engines and tools

Dark Web & Tor Search

Tool Description Access
Ahmia Tor hidden services search engine with abuse blacklist Web (clearnet) + Tor
Torch Long-running Tor search engine Tor only
OnionLand Tor hidden service search Tor
Haystack Dark web search engine Tor
DarkSearch Dark web search with API access Clearnet interface
Kilos Dark web market search engine Tor
IntelX Indexes some dark web content alongside clearnet Clearnet + archived Tor content

Code & Pastebin Search

Tool Description Useful For
GitHub Code Search Search across all public GitHub repositories Leaked credentials, API keys, internal docs
Grep.app Full-text search across 500K+ public Git repos Fast code/secret search
SearchCode Multi-platform code search (GitHub, BitBucket, GitLab) Cross-platform code discovery
Pastebin Public paste monitoring Leaked data, doxes, breach dumps
PasteLert Alert service for Pastebin mentions Monitor brand/keyword mentions
PublicWWW Source code search across live websites Find sites using specific scripts, pixels

24. Breach Data & Credential Intelligence

Databases and tools for investigating data breaches and leaked credentials.

Core Functions

Function Description Tools
Email breach check Check if an email appears in known breaches Have I Been Pwned, XposedOrNot
Password breach check Check if a specific password has been leaked HIBP Pwned Passwords (k-anonymity)
Domain breach check Find all breached accounts for an organization HIBP Domain Search, SpyCloud
Stealer log search Search through infostealer malware dumps HIBP (stealer logs), Hudson Rock
Credential validation Check if leaked credentials are still active Ethical considerations — use for defense only
Combolist monitoring Track distribution of credential dumps Threat intelligence feeds

Key Tools

Tool Description License
Have I Been Pwned (HIBP) Gold standard breach notification (12B+ compromised accounts, 900+ breaches) Free (API commercial)
XposedOrNot Alternative breach database with free API Free
DeHashed Breach database search with credential data Commercial
LeakCheck Breach data search by email, username, phone, keyword Freemium
SpyCloud Enterprise breach analytics and credential monitoring Commercial
Hudson Rock Infostealer intelligence — compromised computers and credentials Commercial
Snusbase Breach data search engine Commercial
IntelX Archived leaked datasets and pastes Freemium

25. Public Records & Government Databases

Free and paid access to court records, property data, business filings, and other government records.

US Federal Records

Resource Description Access
PACER Federal court electronic records (dockets, filings, opinions) Paid (fees waived < $15/quarter)
RECAP Free archive of PACER documents + browser extension Free (via Free Law Project)
CourtListener Federal and state court opinions search Free
Federal Judicial Center IDB Integrated federal court docket database Free
EDGAR SEC company filings, insider trading, proxy statements Free
FEC.gov Federal campaign finance data — donations, expenditures Free
OpenSecrets Money in politics — lobbying, donations, PACs Free
USAspending.gov Federal spending and contract data Free
SAM.gov Federal contractor registrations and exclusions Free
FOIA.gov Federal FOIA request portal Free

US State & Local Records

Resource Description Access
National Center for State Courts Directory of all state court websites Free
Black Book Online County-level court records by state/county Free
County assessor databases Property ownership, assessed values, tax records Free (varies by county)
Secretary of State databases Business entity filings, UCC filings, notary records Free (varies by state)
State corporation commission Corporate registration and annual reports Free (varies by state)

International Records & Registries

Resource Description Coverage
OpenCorporates World’s largest open database of company data 140+ jurisdictions
ICIJ Offshore Leaks Panama Papers, Paradise Papers, Pandora Papers entities 810K+ entities
OpenSanctions Sanctions, PEP, criminal interest entities 2.1M+ entities, 328 sources
UK Companies House UK company filings, directors, accounts Free, comprehensive
EU Transparency Register EU lobbyist registrations Free
Worldwide OSINT Map Country-by-country directory of public records databases 614+ services, 100+ countries

26. Conflict Monitoring & Weapons Identification

Tools for tracking armed conflicts, political violence, and munitions.

Conflict Event Databases

Tool Description Coverage
ACLED Armed Conflict Location & Event Data Project Global political violence + protests
GDELT Global Database of Events, Language, and Tone 300+ categories of events, real-time
LiveUAMap Interactive conflict mapping with real-time events Ukraine, Middle East, Syria, global
Beholder 80+ OSINT sources, 45+ specialized dashboards Global threat intelligence
Uppsala Conflict Data Program Academic conflict dataset (1946–present) Global armed conflicts
ICG CrisisWatch Monthly global conflict tracking International Crisis Group

Weapons & Munitions Identification

Tool Description Use Case
METIS (Fenix Insight) 6,700+ technical records, 45,000+ munitions images, 500K+ events Identify weapons/munitions in photos and video
iTrace (Conflict Armament Research) Field investigation + weapons tracking database Trace weapon supply chains from point of use
Open Source Munitions Portal Searchable verified munitions image library Visual identification reference
Bulletpicker.com Ammunition guidebooks and armed forces manuals Reference for ammunition identification
Small Arms Survey Research on weapons, violence, and arms transfers Data and analysis for policy
Janes Defense and security intelligence (ships, aircraft, weapons) Commercial; used by governments and media

27. Environmental & Wildlife Monitoring

Satellite and sensor tools for investigating environmental crimes and natural events.

Core Functions

Function Description Tools
Active fire detection Near real-time fire/hotspot detection from satellite NASA FIRMS (MODIS/VIIRS), Sentinel Hub
Deforestation monitoring Track forest loss and illegal logging Global Forest Watch, Sentinel Hub
Illegal fishing detection Vessel tracking + fishing activity analysis Global Fishing Watch
Air quality monitoring Real-time pollution and air quality data IQAir, PurpleAir, OpenAQ
Water quality / oil spill detection Satellite-based oil spill and water monitoring Sentinel-1 SAR, SkyTruth
Climate data analysis Historical and projected climate data NOAA, ERA5, Climate Reanalyzer
Wildlife trade monitoring Track illegal wildlife trade online TRAFFIC, WWF

Key Platforms

Tool Description License
NASA FIRMS Fire Information for Resource Management System — 3-hour latency global fire data Free
Global Forest Watch Satellite-based deforestation monitoring and alerts Free
Global Fishing Watch AIS-based fishing activity tracking and vessel monitoring Free
Sentinel Hub ESA Copernicus satellite imagery browser and API Free tier available
SkyTruth Satellite analysis for environmental investigations Free / nonprofit
AllTrails Trail and outdoor area mapping (useful for geolocation) Free

28. Wireless & Signals Intelligence

Tools for investigating wireless networks, cell towers, and radio frequency data.

Core Functions

Function Description Tools
WiFi network geolocation Map physical location of WiFi access points by SSID/BSSID WiGLE
Cell tower mapping Locate and map cell tower positions OpenCellID, CellMapper
IMSI catcher detection Detect fake base stations / Stingrays AIMSICD, SnoopSnitch, Rayhunter
Bluetooth device tracking Discover and track Bluetooth devices in an area nRF Connect, Bluetana
Radio frequency monitoring Monitor radio traffic (ADS-B, marine VHF, ham) RTL-SDR, SDR++
Wardriving Systematic scanning and mapping of wireless networks WiGLE apps, Kismet, Pwnagotchi

Key Tools

Tool Description License
WiGLE Global WiFi/cell network map and lookup (billions of networks) Free for non-commercial
OpenCellID Open cell tower location database Free (community-contributed)
CellMapper Cell tower mapping with carrier identification Free
Kismet Wireless network detector, sniffer, and IDS Open Source
RTL-SDR Software-defined radio for monitoring ADS-B, marine, ham frequencies Open Source (hardware ~$25)

29. Cyber Threat Intelligence

Tools for investigating threats, malware, indicators of compromise, and adversary infrastructure.

Core Functions

Function Description Tools
File/URL reputation check Scan files and URLs against 70+ antivirus engines VirusTotal
IoC enrichment Enrich IP, domain, hash indicators with threat context VirusTotal, AlienVault OTX, IOCLens
Threat actor tracking Monitor known APT groups and campaigns MITRE ATT&CK, APT Watch
Malware sandbox analysis Detonate files in isolated environments Any.Run, Joe Sandbox, Hybrid Analysis
Threat feed aggregation Consolidate threat intelligence from multiple sources MISP, OpenCTI, APT Watch
IP/domain reputation Score IP addresses and domains for malicious activity AbuseIPDB, GreyNoise, Shodan
Phishing detection Identify phishing domains and campaigns PhishTank, URLScan.io, CheckPhish

Key Tools

Tool Description License
VirusTotal File/URL/domain/IP analysis against 70+ AV engines; largest threat reputation database Freemium (API commercial)
AlienVault OTX Open Threat Exchange — community-contributed threat indicators Free
MISP Open-source threat intelligence sharing platform with 100+ expansion modules Open Source
OpenCTI Cyber threat intelligence platform with STIX2 data model Open Source
AbuseIPDB IP address abuse reporting and reputation database Freemium
GreyNoise Internet-wide scanner data — distinguish threats from noise Freemium
URLScan.io URL inspection and phishing detection Free
Any.Run Interactive malware sandbox Freemium
Hybrid Analysis Free malware analysis sandbox (by CrowdStrike) Free
APT Watch Aggregated IoCs from OSINT — 5,700+ IPs, 1.5M+ domains Open Source
IOCLens Browser extension for instant IoC enrichment from 9+ sources Free
MITRE ATT&CK Adversary tactics, techniques, and procedures knowledge base Free

30. OSINT Automation Frameworks

Platforms that automate multi-source intelligence collection into unified workflows.

Core Functions

Function Description Tools
Automated reconnaissance Scan a target (domain, email, IP) across hundreds of sources SpiderFoot, BBOT, Recon-ng
Visual link analysis Map relationships between entities from multiple data sources Maltego
Module-based architecture Extensible plugins for different data sources and analysis All major frameworks
Result correlation Cross-reference findings across multiple sources automatically SpiderFoot, Maltego
Report generation Generate structured reports from investigation findings Maltego, Maigret, SpiderFoot
API orchestration Coordinate queries across multiple OSINT APIs Recon-ng, BBOT
Scheduled monitoring Continuous monitoring for changes or new data SpiderFoot (HX), custom tools

Key Frameworks

Framework Description Modules License
Maltego Visual link analysis + entity transforms 200+ transforms; 1B+ identities; 200M+ company records Commercial (free CE tier)
SpiderFoot Automated OSINT collection and correlation 300+ modules; active/passive modes; web + CLI Open Source (HX commercial)
Recon-ng Metasploit-style reconnaissance framework Modular plugins; API integration; SQLite workspaces Open Source
BBOT Recursive modular OSINT framework 80+ modules; subdomain enum, port scan, web scraping Open Source
Spiderfoot HX Hosted version of SpiderFoot with dashboards and scheduling 300+ modules; adversarial AI detection Commercial
sn0int Semi-automatic OSINT framework with Rust-based engine Package manager for modules; sandbox execution Open Source
Photon Fast web crawler for OSINT data extraction URLs, emails, files, social media Open Source

2026 Trends in OSINT Automation

  • AI-powered transforms — Maltego 2026 uses NLP to infer hidden connections between entities
  • Adversarial AI detection — SpiderFoot flags AI-generated content in scraped data
  • STIX/TAXII integration — Frameworks increasingly support standardized threat intelligence sharing
  • GraphQL APIs — Modern frameworks expose investigation data via GraphQL for custom frontends
  • Multi-agent orchestration — Emerging pattern of coordinating multiple specialized agents (IntellyWeave, Crucix)

31. OSINT Training, Communities & Meta-Resources

Where investigators learn techniques and stay current on tools.

Training & Education

Resource Description Access
Bellingcat Open-source investigation methodology, case studies, how-to guides Free
SANS SEC497 OSINT in the enterprise — formal training course Commercial
OSINT Curious Free training, webcasts, and community events Free
TraceLabs CTF-style events for missing persons investigations (ethical OSINT) Free
Sector035 OSINT Newsletter Weekly OSINT tool and technique roundup Free
OSINT Newsletter Tools, techniques, and investigations weekly digest Free
The OSINT Vault Investigation workflows, bookmarklets, multi-search launchers Free
NixIntel Geolocation challenges and OSINT technique deep-dives Free
Toddington International Comprehensive OSINT training and resources Commercial

Key Communities

Community Platform Focus
r/OSINT Reddit General OSINT discussion and tools
OSINT Curious Discord Discord Community chat, challenges, events
Trace Labs Multiple Missing persons CTF competitions
Bellingcat Discord Discord Investigation collaboration
OSINT Team Medium/Blog Tools reviews and tutorials
IntelTechniques Web Michael Bazzell’s OSINT methodology (privacy-focused)

Bookmarklet & Browser Tools

Tool Description Use Case
OSINT Vault Multi-Search Launch 80+ platform searches from single query Rapid username/domain/email investigation
Hunchly Auto-capture every page visited with timestamps Evidence preservation during browsing
InVID Verification Plugin Right-click image/video verification Quick media authentication
Wappalyzer Browser extension for website technology detection Identify CMS, frameworks, analytics
Shodan Browser Extension See Shodan data for any website you visit Quick infrastructure check

Part III — Analysis & Upgrade Planning


32. Reference Platforms (Full-Stack Investigative)

Major platforms that integrate multiple categories above into unified workflows.

Tier 1: Purpose-Built for Investigative Journalism

Platform Developer License Key Differentiator
OCCRP Aleph OCCRP Open Source (MIT) Cross-referencing against hundreds of datasets; FollowTheMoney data model; network/timeline visualization
ICIJ Datashare ICIJ Open Source Local-first privacy; multi-NER pipeline; batch search API; team collaboration
Google Pinpoint Google News Initiative Free (verified journalists) Audio transcription; handwriting OCR; table-to-spreadsheet; 200K docs/collection
DocumentCloud MuckRock Foundation Open Source 6.9M+ public documents; annotation; embedding API; AI OCR add-ons
Open Semantic Search opensemanticsearch.org Open Source (GPL) Full-stack ETL + search + knowledge graph; thesaurus/ontology support; faceted exploration

Tier 2: AI-Native Investigation Platforms (2024-2026)

Platform License Key Differentiator
Trove Commercial RAG + Claude/Gemini; financial extraction; network visualization; natural language Q&A
Arkham Mirror Open Source Air-gapped/local-first; offline RAG; contradiction detection; knowledge graph
IntellyWeave Open Source GLiNER entities; Mapbox geo; multi-agent reasoning; hypothesis-driven
Presswork.ai Commercial FOIA generation; source CRM; deep research engine; fact-check assist
Crucix Open Source 27 real-time data sources; cross-source correlation; Telegram/Discord alerts

Tier 3: General OSINT Platforms Used by Journalists

Platform Type Key Capability
Maltego Commercial (free CE tier) 200+ data transforms; 1B+ identities; link analysis
i2 Analyst’s Notebook Commercial 30-year industry standard; visual intelligence analysis
SpiderFoot Open Source 200+ recon modules; automated scans
Hunchly Commercial Automatic evidence capture; court-ready packaging
OpenSanctions Open Data (free for journalism) 2.1M+ entities; sanctions + PEP + corporate data

33. Feature Gap Matrix: OSS vs. Ecosystem

How Open Semantic Search’s current capabilities compare to the investigative journalism ecosystem, organized for upgrade planning.

Legend

  • Strong = OSS has robust implementation
  • Basic = OSS has partial/dated implementation
  • Gap = Feature absent from OSS; present in peer platforms
  • Emerging = Category emerging in 2024-2026; no peer has mature implementation
# Capability Area OSS Current State Peer Benchmark Priority Signal
1 Multi-format document ingestion Strong (Tika + ETL) Aleph, Datashare Maintain
2 OCR (printed text) Strong (Tesseract) Google Vision, Textract Maintain; consider cloud OCR option
3 Handwriting / audio transcription Gap Pinpoint, Whisper High — differentiator for leak analysis
4 Full-text search Strong (Solr) Elasticsearch (Datashare), Solr (OSS) Maintain
5 Faceted / exploratory search Strong Aleph, Datashare Maintain; modernize UI
6 Semantic / embedding search Gap Trove, Arkham Mirror, IntellyWeave Critical — table-stakes for 2026
7 Natural language Q&A (RAG) Gap Trove, Pinpoint, Presswork Critical — highest user demand
8 Named Entity Recognition Basic (spaCy) GLiNER, CoreNLP, multilingual models Upgrade — add more entity types, improve accuracy
9 Entity linking / disambiguation Basic (Entity Search API) Aleph FtM, Wikidata linking Upgrade
10 Knowledge graph (Neo4j) Basic (integration exists) Aleph FtM, Arkham Mirror Upgrade — richer graph model, better UI
11 Network visualization Basic (Cytoscape.js) Maltego, i2, Aleph diagrams Upgrade — interactive, collaborative
12 Timeline visualization Gap Aleph, Trove, TimelineJS High — standard investigative feature
13 Thesaurus / ontology management Strong (SKOS/RDF) Unique to OSS Maintain — competitive advantage
14 Corporate / sanctions data integration Gap Aleph + OpenSanctions High — integrate via API
15 FOIA / public records integration Gap MuckRock, DocumentCloud Medium — API integration possible
16 Collaboration / team workspaces Gap Aleph investigations, Datashare server Critical — required for team investigations
17 Annotation / tagging Basic (manual tagging) DocumentCloud, Aleph, Hypothesis Upgrade
18 Redaction with audit trail Gap Redactable, DocumentCloud High — essential for source protection
19 Secure document intake Gap SecureDrop, GlobaLeaks Medium — complementary integration
20 Web archiving / evidence preservation Gap Hunchly, Auto Archiver Medium — integration with archive APIs
21 AI-powered summarization Gap Trove, Presswork Critical — paired with RAG
22 Contradiction detection Gap Arkham Mirror Medium-High — unique differentiator
23 Automated timeline extraction Gap Arkham Mirror, Trove High
24 Cross-dataset entity matching Gap Aleph cross-referencing High
25 Data cleaning / record linkage Gap OpenRefine reconciliation Medium — complementary tool
26 Transportation tracking integration Gap ADS-B, MarineTraffic APIs Low — niche; API integration
27 Crypto / blockchain analysis Gap Chainalysis Low — specialized commercial space
28 Geolocation tools integration Gap Bellingcat toolkit Low-Medium — API integration
29 Image/video verification Gap InVID, Forensically Low — specialized tools exist
30 Modern responsive UI Gap (PHP + Django) Aleph (React), Trove, Datashare Critical — modernize frontend
31 REST API / developer ecosystem Basic Datashare API, Aleph API, DocumentCloud API High — enable integrations
32 Multi-language support Basic (spaCy models) New/s/leak (40 languages), Datashare Upgrade
33 Hypothesis-driven investigation Gap IntellyWeave Emerging
34 Multi-agent AI orchestration Gap IntellyWeave, Crucix Emerging
35 Real-time data source aggregation Gap Crucix (GDELT, ACLED, etc.) Emerging
OSINT Resource Integration (§18-31)
36 Username/identity enumeration Gap Sherlock, Maigret, WhatsMyName Low — complementary CLI tools
37 Email intelligence pipeline Gap Hunter.io, Holehe, HIBP API Medium — entity enrichment via API
38 Domain/infrastructure reconnaissance Gap Shodan, Censys, DNSDumpster Medium — infrastructure context for entities
39 Breach data integration Gap HIBP API, IntelX Medium — entity risk scoring from breach exposure
40 Public records connectors Gap PACER/RECAP, CourtListener, EDGAR Medium-High — structured data import
41 Conflict event data feeds Gap ACLED, GDELT APIs Medium — real-time event ingestion
42 Threat intelligence enrichment Gap VirusTotal API, AbuseIPDB Low — niche cybersecurity use case
43 OSINT automation orchestration Gap SpiderFoot, Recon-ng, Maltego Low — complementary tools; API integration

Summary: Top Upgrade Priorities for Open Semantic Search

Based on the gap analysis above, the highest-impact upgrade areas are:

Must-Have (Critical Gaps)

  1. Semantic / Embedding Search — Move beyond keyword matching to meaning-based retrieval using vector embeddings
  2. RAG-Powered Natural Language Q&A — Let journalists ask questions in plain English and get cited answers from their document collections
  3. AI-Powered Summarization — Automatically summarize documents and document sets
  4. Team Collaboration Workspaces — Private investigation spaces with shared access, annotation, and tagging
  5. Modern Responsive UI — Replace aging PHP/Django frontend with a modern React/Vue application

High Priority

  1. Enhanced NER — More entity types (financial, legal), better accuracy, multilingual expansion
  2. Timeline Visualization — Chronological event plotting extracted from documents
  3. Richer Knowledge Graph — Upgrade Neo4j integration with better graph models and interactive visualization
  4. Corporate/Sanctions Data APIs — Integrate OpenSanctions, OpenCorporates, ICIJ Offshore Leaks
  5. Redaction with Audit Trail — Permanent redaction with PII detection and compliance logging
  6. Cross-Dataset Entity Matching — Match entities across collections like Aleph does
  7. REST API Modernization — Full CRUD API enabling third-party integrations and automation

Medium Priority

  1. Audio/video transcription (Whisper integration)
  2. Contradiction detection across documents
  3. Automated timeline extraction from unstructured text
  4. Web archive integration (Wayback Machine, Archive.today APIs)
  5. FOIA/public records integration (MuckRock/DocumentCloud APIs)
  6. Data cleaning/record linkage (OpenRefine-style reconciliation)

Future / Watch

  1. Hypothesis-driven investigation workflows
  2. Multi-agent AI orchestration
  3. Real-time data source aggregation (GDELT, ACLED feeds)
  4. Blockchain analysis integration
  5. Transportation tracking integration

OSINT Integration Opportunities (from Part II)

  1. Public records connectors (PACER/RECAP, CourtListener, EDGAR) — structured data import
  2. Breach data entity enrichment (HIBP API) — flag entities with known breach exposure
  3. Email intelligence pipeline (Hunter.io, Holehe) — enrich person entities with email accounts
  4. Domain infrastructure context (Shodan, Censys) — enrich organization entities with hosting data
  5. Conflict event data feeds (ACLED, GDELT) — real-time event ingestion for timeline features

This directory was compiled from analysis of: OCCRP Aleph, ICIJ Datashare, Google Pinpoint, DocumentCloud, MuckRock, Bellingcat Online Investigation Toolkit, Maltego, i2 Analyst’s Notebook, OpenSanctions, OpenCorporates, Trove, Arkham Mirror, IntellyWeave, Presswork.ai, Crucix, Hunchly, SecureDrop, GlobaLeaks, OpenRefine, Chainalysis, RadianceFleet, New/s/leak 2.0, Orson AI, Redactable, OSINT Framework, Awesome OSINT, OSINT Tools Library, Sherlock, Maigret, PhoneInfoga, Shodan, Censys, Have I Been Pwned, Intelligence X, WiGLE, NASA FIRMS, ACLED, GDELT, SpiderFoot, Recon-ng, BBOT, VirusTotal, MISP, and the broader OSINT tool ecosystem as of April 2026.

Was this article helpful?
0 out of 5 stars
5 Stars 0%
4 Stars 0%
3 Stars 0%
2 Stars 0%
1 Stars 0%
5
Please Share Your Feedback
How Can We Improve This Article?
Table of Contents