Documentation/Nodes/Data Processing

Data Processing Nodes

Comprehensive data manipulation toolkit with 6 specialized nodes for filtering, sorting, merging, and transforming tabular data in scientific workflows.

Node Reference

Detailed documentation for each data processing node available in Bioshift.

Select Columns

Select specific columns from DataFrames

Type: select_columnsCategory: Column Operations

Key Features

  • Single or multiple column selection
  • Column name validation
  • Column reordering
  • Pattern-based selection
  • Preview functionality

Input Ports

dataframedata

Input DataFrame

columnsdata

List of column names to select

Output Ports

selected_datadata

DataFrame with selected columns

column_countnumber

Number of selected columns

Filter Rows

Filter DataFrame rows based on conditions

Type: filter_rowsCategory: Row Operations

Key Features

  • Expression-based filtering
  • Multiple condition support
  • Interactive condition builder
  • Preview of filtered results
  • Filter validation

Input Ports

dataframedata

Input DataFrame

conditionstring

Filter condition expression

Output Ports

filtered_datadata

Filtered DataFrame

row_countnumber

Number of filtered rows

filter_summarydata

Filtering operation summary

Slice Rows

Extract rows by position or range

Type: slice_rowsCategory: Row Operations

Key Features

  • Index-based slicing
  • Range specification
  • Step parameter support
  • Negative indexing
  • Boundary validation

Input Ports

dataframedata

Input DataFrame

start_indexnumber

Starting row index

end_indexnumber

Ending row index

Output Ports

sliced_datadata

Sliced DataFrame

row_countnumber

Number of sliced rows

Drop Duplicates

Remove duplicate rows from DataFrames

Type: drop_duplicatesCategory: Data Cleaning

Key Features

  • Column-specific duplicate detection
  • Keep first/last occurrence options
  • Duplicate count reporting
  • Preserve original index option
  • Performance optimized

Input Ports

dataframedata

Input DataFrame

columnsdata

Columns to consider for duplicates

Output Ports

cleaned_datadata

DataFrame without duplicates

duplicates_countnumber

Number of duplicates removed

duplicate_rowsdata

Removed duplicate rows

Sort Rows

Sort DataFrame rows by one or more columns

Type: sort_rowsCategory: Data Organization

Key Features

  • Multiple column sorting
  • Ascending/descending order
  • Stable sort option
  • Custom sort key functions
  • Null value handling

Input Ports

dataframedata

Input DataFrame

sort_columnsdata

Columns to sort by

ascendingdata

Sort order for each column

Output Ports

sorted_datadata

Sorted DataFrame

sort_infodata

Sorting operation details

Merge DataFrames

Combine multiple DataFrames using various join operations

Type: merge_dataframesCategory: Data Combination

Key Features

  • Multiple join types (inner, outer, left, right)
  • Column-based merging
  • Index-based merging
  • Suffix handling for duplicate columns
  • Merge validation

Input Ports

left_dataframedata

Left DataFrame

right_dataframedata

Right DataFrame

left_ondata

Left DataFrame join columns

right_ondata

Right DataFrame join columns

Output Ports

merged_datadata

Merged DataFrame

merge_infodata

Merge operation statistics

Workflow Examples

Common data processing workflows you can build with these nodes.

Data Cleaning Pipeline

Complete workflow for cleaning and preparing datasets

  1. 1Load raw dataset using CSV Reader
  2. 2Remove duplicate rows with Drop Duplicates
  3. 3Filter invalid entries with Filter Rows
  4. 4Select relevant columns with Select Columns
  5. 5Sort data by key column with Sort Rows
  6. 6Save cleaned data with Save DataFrame

Data Integration Workflow

Combine multiple datasets with join operations

  1. 1Load primary dataset
  2. 2Load secondary dataset
  3. 3Identify common join columns
  4. 4Perform inner join with Merge DataFrames
  5. 5Handle any remaining duplicates
  6. 6Validate merged data integrity
  7. 7Export final integrated dataset

Data Sampling Pipeline

Extract representative samples from large datasets

  1. 1Load large dataset
  2. 2Apply filtering criteria
  3. 3Sort by relevant metrics
  4. 4Use Slice Rows for random sampling
  5. 5Validate sample representativeness
  6. 6Export sample for analysis

Data Processing Pipeline

Typical workflow for data preprocessing in scientific applications.

1

Data Loading

  • CSV Reader
  • Excel Reader
  • File Input
2

Data Cleaning

  • Drop Duplicates
  • Filter Rows
  • Select Columns
3

Data Transformation

  • Sort Rows
  • Slice Rows
  • Merge DataFrames
4

Data Export

  • Save DataFrame
  • Table View
  • Text View