The Challenge of Bioprocess Data Management: The 6S Problem

Biology has turned a corner. Once a field devoted solely to understanding life on Earth, biology is now increasingly employed to make critical products across a wide range of industrial sectors. Decades of research in genetics, molecular biology, and biochemistry have made way for an epoch of biomanufacturing, where industrial enterprises can now apply a constellation of biological systems, concerts of enzymes, and genetic tools to produce bio-based goods. 

Biomanufacturing’s roots can be traced back thousands of years, originating with humanity’s storied love affair with fermentation (brewing, baking, and cheese making). However, modern biomanufacturing using recombinant biotechnologies and genetic engineering took off in the 1980s and has been growing ever since. More recently, biomanufacturing’s use has skyrocketed in part thanks to sustained emphasis on new biopharmaceutical development, the advent of alternative protein and cultivated meat, and the rising demand for biobased materials and ingredients across sectors like agriculture, food and beverage, personal care, and beyond. Importantly, this precipitous growth is showing no signs of slowing down. 

Stemming from this demand, global capacity constraints for biomanufacturing (especially precision fermentation) have led to urgent site development at biotech companies and CDMOs/CMOs while also stimulating government investment in biomanufacturing readiness, including the US, UK, Japan, China, Australia, and beyond. 

To ensure commercialization success, bioprocess development scientists and engineers must collect, analyze, and store a wide variety of bioprocess data, including critical process parameters (CPPs), critical quality attributes (CQA), and other key variables. Unfortunately, bioprocess data management solutions have not kept up with the increasing volume and complexity of data generated by today’s bioprocesses. So, despite its importance, managing bioprocess data remains a core challenge for the industry, which helps to explain why many companies struggle with commercialization despite brilliant products and significant demand. 

To support a widespread data revolution in the biomanufacturing industry, this blog illustrates the challenge of bioprocess data management, highlighting six key data barriers the industry faces today.

Biology’s Inherent Manufacturing Complexity 

Despite massive gains, it’s clear that biomanufacturing is far from being a mature enterprise. For other manufacturing industries like the chemicals process and semiconductor sectors, improved data access and use enabled step changes in productivity and profitability. Recognizing the importance of utilizing manufacturing data, mature industries, like Oil and Gas, continue to further digitize their manufacture. processes further and further.  Likewise, an essential maturation step for biomanufacturing revolves around better analyzing and understanding bioprocess data to optimize scalability, generate higher yields, and keep the cost of goods (COGS) low.

Though bioprocess data management is vital for biomanufacturing’s continued development, the inherent complexity of biological systems makes managing this data a much larger challenge. Not only does biomanufacturing produce a large volume of diverse datasets, but it is also often more difficult to meaningfully decipher, given the complicated interplay of molecular genetics, cellular physiology, enzymatic activities, and fluid dynamics taking place within biological systems at any given time during the bioprocess.

The 6S Problem of Bioprocess Data Management: No Single “Source of Truth” and Too Much Data

Because biological systems have so many intricate details that affect their application in biomanufacturing, bioprocess teams track a lot of variables over time to ensure efficient production runs and effective decision-making. Though bioprocess teams need to collect and use a lot of critical data, they generally must do so without a single source of truth to integrate, centralize, and organize that information. As a result, bioprocess scientists and engineers often just cope with sizable inefficiencies due to a dearth of solutions and an abundance of work they must nevertheless accomplish to get a bioproduct to market.

Playing off the 5S principles of lean manufacturing, we like to say that the bioprocess data management challenge is a “6S problem” comprised of data sources, size, silos, software, security, and standards.


To address the challenge of bioprocess data management, we must shine a light on the tremendous diversity of biomanufacturing data. First, it is helpful to note that bioprocess data comes from a variety of sources that can be bucketed into on-line, off-line, metadata, and calculated metrics categories.

On-line measurements collect data from biosensors or soft sensors built into the bioreactor apparatus. This type of data is automatically measured and can be used for constant, real-time bioprocess monitoring. Typically, on-line sensors collect data such as pH, temperature, pressure, dissolved O2/CO2 concentration, conductivity, agitation/stirrer speed, viscosity, and exhaust O2/CO2 (Off-Gas). Increasingly, more sophisticated measurements, like those from built-in near-infrared (NIR), dielectric (DS), Fourier-transform infrared (FTIR), fluorescence (FS), and Raman spectroscopy, are also being built into advanced bioreactor designs. 

On the flip side, off-line measurements collect data from samples removed from the bioreactor. Generally, sample-level methods depend on more involved and time-consuming workflows. This may include measurements taken using a Cedex Bio Analyzer, YSI, gas chromatography (GC), mass spectroscopy (MS), NMR, flow-injection analysis, flow cytometry (FC), HPLC, ELISA, electrophoresis, and microscopy. 

Metadata is data and information that provides additional necessary context and backdrop to an experiment or production run. Put another way, it is “data about the data.” For example, metadata might include information on cell systems, user information, lot numbers, and much more. Metadata is crucially important because it helps articulate experimental aims, support reliable data science and modeling, and provide deeper traceability into data output. Thus, it’s unsurprising that maintaining metadata is crucial for regulatory compliance and audit trails

Calculated metrics require analysis and processing of raw and/or calibrated data points to generate descriptive bioprocess indicators. For example, this includes essential information like growth rates, respiratory quotients (RQ), specific yield, and specific productivity. These metrics are some of the most vital indicators of bioprocess efficacy, yet users often manually calculate them using spreadsheets, drawing data from many sources to calculate the final figure.

Data Type Measurements
On-line Data pH, temperature, pressure, dissolved O2/CO2 concentration, conductivity, agitation/stirrer speed, viscosity, exhaust O2 and CO2 (Off-Gas), Near-Infrared (NIR), Dielectric (DS), Fourier-Transform Infrared (FTIR), Fluorescence (FS), Raman Spectroscopy
Off-line Data Cedex Bio Analyzer, YSI, Gas Chromatography (GC), Mass Spectroscopy (MS), NMR, Flow Cytometry (FC), HPLC, ELISA, Electrophoresis, Microscopy
Metadata Process conditions (controller setpoints, media formulation, inoculum density), strain data (organism, strain ID, plasmid, genomic sequence, cell bank), reactor configuration (geometries, gasses, feeds), site information, run quality, user information, lot numbers, process conditions
Calculated Metrics Growth Rates, Respiratory Quotients (RQ), Specific Yield, Specific Productivity
Table 1: A non-exhaustive list of typical measurements taken during bioprocess runs.

It is also important to note that the more sophisticated the analytical tools (MS, HPLC, genomic data, NMR, FC), the more complicated it becomes to analyze and meaningfully associate the information with production runs. In short, the source barrier multiplies significantly as analysis needs become more complex. 

As a final point, it’s worth mentioning that downstream processing makes its own data that also needs to be stored and managed. While downstream processing doesn’t usually create as much data, it’s still critical information, especially as it relates to upstream results, critical quality attributes, and compliance.


To understand what makes managing bioprocess data challenging, you also have to conceptualize the sheer volume of data produced by bioprocess runs, including raw measurements, calibrated measurements, set points, calculated metrics, and metadata. Collectively, this results in a lot of data with every run. 

To help put this into perspective, we pulled representative data from Invert’s customers to make conservative estimates about the total number of data points they produce annually. Our analysis shows that large companies produce at least 121 million data points every year (Table 1). Even start-ups can expect tens of millions of data points per year, reaching over 40.4 million in our cursory conservative estimate. 

Company Size Total No. of On-line Data Points (per Yr.) Total Off-Line Points (per Yr.) Total Unique Metadata Points (per Yr.) Total (per Yr.)
Start-up 40,320,000 32,000 48,000 40,432,000
Midsize 80,640,000 64,000 96,000 80,864,000
Large, Established 120,960,000 96,000 144,000 121,296,000
Table 2: Average Number of Data Points Generated From Bioprocess Labs Using Invert’s Software

The massive scale of bioprocess data effectively makes simpler general-purpose data tools, like Excel, wholly inadequate for the task of managing bioprocess data. In a spreadsheet, users can review one run easily. Users might even be able to manually manage and review up to 10 runs in a single file. But once users need to start looking for patterns across 10s or 100s of runs or compare recent data to data several months old, spreadsheets become prohibitively time-consuming and ineffective.


Having many soft sensor, bioreactor, and instrument sources also means that biomanufacturers must contend with the fact that each (or at least most) have their own software accompanying them. Once the hardware collects the data, it lives within the software specific to that hardware.

This makes for poor interoperability between digital systems, forcing bioprocess scientists and engineers to cope in one of two ways. One, they can perform rig-ups to make these software communicate, but these are easier said than done, leading to expensive and imperfect solutions. Two, they can live without software integration and manually extract the data instead. However, this consumes a lot of time and limits how much a user can accomplish. Put another way, time spent on data extraction and management is time not spent on run design and advanced analysis. This choice also means that biomanufacturers may be unable to take full advantage of the generated data simply because their workflow takes too long to do routinely.


In addition to many individual sources and software, data silos also form because a company has multiple teams, production lines, and manufacturing sites, each producing their own data. Without seamless connectivity between disparate groups, data visibility remains poor, even though one group might generate information beneficial to other teams. Even if this information eventually trickles to them, it might be too late to prevent a costly mistake or less optimized batch.

The data silo problem becomes even more pronounced when teams need to hire a contract manufacturing partner (CDMOs/CMOs) and perform tech transfer. Compiling this data from many groups and sites is an organizationally demanding task, especially when assembled into giant spreadsheets. Even if the collected information makes sense to internal teams, there’s a good chance it remains unclear for external teams that lack significant context. 


Companies must also securely manage bioprocess data in a compliant fashion. Bioprocess data is precious, and in many ways, it’s the lifeblood of the entire operation. So, companies must store that data such that data loss or leaks cannot happen. However, this is not easy to do, given the diversity and scale of the data they generate.  

Going further, regulatory bodies expect biomanufacturers to maintain data records related to their products. This is especially true for pharmaceutical or medical device products, where the regulatory environment is strict. The less traceable or secure the data trail, the more likely biomanufacturers will draw regulatory ire during audits, particularly if the issue relates to product or control failure.


Although many successful bioproducts are on the market today, modern biomanufacturing remains a nascent industry. This means biomanufacturing workflows, products, technologies, data, and systems will keep evolving over the coming decades. As a recent example, bioreactors continue to trend towards becoming smaller, more productive, and higher throughput. As another, perfusion and continuous biomanufacturing are becoming increasingly popular, as are new cellular and cell-free systems. 

With so much active change, it’s difficult to apply standardized workflows for managing bioprocess data despite demands for them. There is no one-size-fits-all approach. This challenge is made worse by the fact that there hasn’t been much advancement in the space. Innovators remain primarily focused on creating more data to accelerate research and development efforts, largely overlooking equivalent co-development of bioprocess data management and analysis technologies. The research and biomanufacturing hardware advancements have largely left bioprocess data management in the dust.

The Growing Demand For a Bioprocess Data Solution

Though still nascent, biomanufacturing offers tremendous power to make novel, superior, and more sustainable products. Over time, a bevy of new bioproducts will need to reach commercial production scales to enter the market, and they will all need advanced bioprocess data solutions to make their vision a reality. 

In the near future, the biopharma community expects new modalities to take deeper roots.  (like bispecifics, antibody-drug conjugates, RNA vaccines, and cell & gene therapies). Similarly, an impeding biosimilar boom is on the horizon as the first blockbuster biologic drug patents expire (including for Humira and Keytruda). In addition, more and more companies are looking for greener ingredients in their products, and many are turning to biology. As a final example, massive headwinds in fermentation and food tech will only continue as innovators aim to feed 10 billion people by 2050. Accomplishing these feats will no doubt depend on efficient bioprocess data management.

Given the demand, you may wonder why bioprocess data solutions are not more prominently discussed. In a follow-up to this piece, we will explore why bioprocess data solutions have been overlooked…until now! 

Otherwise, Invert is always happy to talk about bioprocess data management. Reach out today to chat about your bioprocess plans and see a demo of our bioprocess software.