We love DataBolt® so much, we use it to create & deliver our data products.
DIH works with a wide variety of data sources, file formats and end-user requirements. We need tools that enable us to quickly & easily pull in raw data, process it as needed (e.g. validate & clean raw data, munge data together, perform final quality checks) and deliver finished data files to end-users in various formats (e.g. CSV, XML, parquet, etc.) and via multiple methods (e.g. bulk file download, API). We use DataBolt® to complete all of these tasks.
DataBolt® accelerates your data science. Often the majority of time in data science is spent on tedious tasks unrelated to data analysis. DataBolt® simplifies those tasks so you can quickly & easily get your data ready for analysis.
Manage Workflows — With DataBolt® you can quickly build complex data science workflows:
Build workflow with task dependencies and parameters
Check task dependencies and their execution status
Intelligently execute tasks including dependencies
Intelligently continue workflows after changed/failed tasks
SQL storage integration
Dask and PySpark integration
Automatically detect data changes
Advanced machine learning features
Manage Data Sets — DataBolt® is a turnkey solution to host data files, documentation and metadata so others can quickly use your data:
Quickly create public and private remote file storage
Push/pull data to/from remote file storage
Secure your data with best practice security
Centrally manage data files across multiple projects
Self-hosted remote storage
On premises deploy
Data encryption
Data versioning
Ingest Data — With DataBolt® you can quickly & reliably ingest raw CSV, TXT and Excel® files to SQL, pandas, parquet and more:
Check and fix data schema changes
Fast writing from pandas to Postgres and MySQL
Ingest messy Excel® files
Out-of-core support
MS SQL integration
Advanced database features
Join Data — Easily join different data sets without writing custom code using fuzzy matches:
Easily find join columns across data frames
Automatic content-based exact joins
PreJoin quality diagnostics
Descriptive stats for id/string joins
Join >2 data frames
Automatic content-based similarity joins
Advanced join quality checks
Fast approximations for big data
DataBolt® Pipe — Make data delivery and on-boarding more efficient:
Turnkey Infrastructure — Manage data files, documentation and meta data using flexible and secure infrastructure
Simple Web GUI — Non-technical users from business teams can access and manage datasets without involving engineering teams
Faster On-boarding — Data consumers benefit from unified delivery, richer meta data and fast access via free GUI, API and Python libraries
Better Documentation — Richer meta data and documentation make it easier and faster to ingest, analyze and understand data
DataBolt® Difference — How DataBolt® is different from other tools and services:
Open Architecture — Designed to promote data exchange to reduce frictions in data pipelines.
Flexible Access — Use Python libraries, REST API or GUI - in cloud or on prem.
Immediate Use — No lengthy sales process, technical setup or deployment.
Community Enabled — Contributions from the community are welcome and encouraged.