Mergeflow data sets

General background on Mergeflow data sets

Currently, Mergeflow collects ca. 15,000 new documents every day. This is done fully automatically, 24/7, and includes news, patents, science publications, blog posts, press releases, clinical trials, and other data sets and sources (see below for details).

Mergeflow collects data sets from across various types of sources, most of them via the web. While some of these sources provide data collection interfaces (APIs) for their contents, most sources do not, and require us to build customized crawlers.

All the data collected by Mergeflow are raw, unstructured text data. We do not rely on structured third-party databases, e.g. for market or investment data. Instead, all these data are extracted from text, via natural language processing, semantic modeling, machine learning, and other methods.

We have written a PDF slidedoc that provides more background on our approach to data collection, analytics, and visualization. 

Scientific Publications

Research papers, conference proceedings, and preprints from across different disciplines. This includes databases such as arxiv and PubMed.

  • Ca. 30,000 new documents / week.
  • Updated every 2 hours.

Patents

Patent publications from all worldwide patent offices, collected and provided by the European Patent Office. Patents are bundled by families. This means that Mergeflow counts patent families rather than individual patent publications (the number of individual patent publications is a lot higher since patent families bundle individual patent publications). Mergeflow first collects each patent in its original form and language, and then replaces non-English texts by the English version as soon as it becomes available via the European Patent Office.

  • Ca. 12,500 new documents / week.
  • Updated weekly.

Industry News

Worldwide business news from across various industries. This includes dedicated industry news outlets and PR newswires but also economics and business sections of many worldwide mainstream media outlets.

  • Ca. 15,000 new documents / week.
  • Updated every 2 hours.

Financial and Investor News

Market estimates and other technology-relevant news. Sources here include dedicated market and finance news portals, as well as finance sections of many worldwide daily news outlets.

  • Ca. 12,000 new documents / week.
  • Updated every 2 hours.

Technology Blogs and News

Thoughts, ideas, and forecasts from the most respected tech journalists around the world. This includes individual blogs but also technology news portals.

  • Ca. 10,000 new documents / week.
  • Updated every 2 hours.

Venture Capital

Updates on venture capital funding events from around the world. This includes dedicated information portals but also blogs etc. run by investors such as venture capitalists.

  • Ca. 1,500 new documents / week.
  • Updated every 2 hours.

Technology Transfer

Technologies available for licensing from universities and R&D organizations worldwide. This includes, for example, Columbia University, Cornell University, Emory, Hebrew University, MIT, Purdue, Rice, Stanford, University of California, US National Laboratories; Universities of Arizona, Cambridge, Chicago, Delaware, Michigan, Pennsylvania, Tel Aviv, Texas; ESA, NASA, Weizmann Institute, and others.

  • Ca. 250 new documents / week.
  • Updated every 2 hours.

Funded Research Projects

Updates on US, UK, and EU publicly funded research projects (e.g. SBIR, NIH, NSF, Innovate UK, EU CORDIS).

  • Ca. 400 new documents / week.
  • Updated every 2 hours.

Clinical Trials

Updates on clinical trials from around the world, e.g. NIH and the EU Clinical Trials Register, and from across all phases.

  • Ca. 600 new documents / week.
  • Updated every 2 hours.

Coverage

All data sets have worldwide coverage, except Funded Research Projects (US, UK, EU).

Most data sets have ca. 7 years of history. Patents and Clinical Trials have ca. 25 years of history.

Still need help? Contact Us Contact Us