Data is not always useful and it doesn't matter how much of it you have.
There’s no mathematical tool to tell you if your hypothesis is true; you can only see whether it is consistent with the data, and if the data is sparse or unclear, your conclusions are uncertain.
Cookiecutter Data Science - Logical, reasonably standardized, but flexible project structure for doing and sharing data science work. (Code)
Pachyderm - Reproducible Data Science at Scale. (Web) (Pachyderm Hub)
Virgilio - Mentor for Data Science E-Learning.
Awesome Data Science with Python - Curated list of Python resources for data science.
nteract - Interactive computing suite for you.
Pandas - Powerful Python data analysis toolkit. (Ongoing list of pandas quirks)
Datasette - Open source multi-tool for exploring and publishing data. (Web) (datasette-graphql) (Running Datasette on DigitalOcean App Platform) (Interesting ideas in Datasette) (HN)
Weld - High-performance runtime for data analytics applications.
Vaex - Out-of-Core DataFrames for Python, visualize and explore big tabular data at a billion rows per second.
Ibis - Python data analysis framework for Hadoop and SQL engines.
Kyso - Data analytics knowledge hub.
Feather - Fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow.
ROOT system - Provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficient way.
Prefect - New workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine.
Monument - High-productivity toolkit for predictions. AutoML for time series on any desktop, laptop or server.
Numba - NumPy aware dynamic Python compiler using LLVM. (5 minute guide) (Web)
Apache Airflow - Platform to programmatically author, schedule, and monitor workflows. (Tutorial) (Kedro-Airflow - Makes it easy to deploy Kedro projects to Airflow.) (Airflow 2.0) (HN) (Introduction to Apache Airflow (2021))
Apache Zeppelin - Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Apache Nifi - Easy to use, powerful, and reliable system to process and distribute data.
Koalas - Pandas API on Apache Spark.
Easy Data Transform - Transform Your Data Without Programming. (HN)
Prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
Dagster - Python library for building data applications: ETL, ML, Data Pipelines, and more. (Dagster: The Data Orchestrator) (Lobsters)
CuPy - NumPy-like API accelerated with CUDA.
SaturnCloud - Manage Data Science applications so Data Scientists don't have to do DevOps.
Falcon - Interactive Visual Analysis for Big Data.
Google Cloud DataLab - Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Great Expectations - Leading tool for validating, documenting, and profiling, your data to maintain quality and improve communication between teams. (Code)
Common Workflow Language - Open standard for describing analysis workflows and tools. (HN)
Apache Kudu - Completes Hadoop's storage layer to enable fast analytics on fast data.
Turing Way - Lightly opinionated guide to reproducible data science. (Code)
Sisu - Fastest Diagnostic Platform for Structured Data. (Introducing Sisu)
Deepnote - Data science notebook for teams. (Docs) (Awesome Deepnote) (HN)
Learn Python for Data Science - Collection of Jupyter Notebooks designed to learn Python for Data Science. (HN)
Jigsaw Labs - Learn Data Science part-time.
Data Science Ontology - Knowledge base about data science.
Data Engineering Project - Implementation of the data pipeline which consumes the latest news from RSS Feeds and makes them available for users via handy API.
Hex Technologies - Turn your notebooks into collaborative, sharable data apps and stories. No more loose CSVs, chart screenshots, or stale decks.
Amundsen by Lyft - Open source data discovery and metadata engine.
Streamlit Sharing - Platform for deploying, managing, and sharing your apps. (HN)
PandasGUI - GUI for analyzing Pandas DataFrames.
Holistics - Data Modeling & Self-Service BI Platform.
Neptune.ai - Experiment tracking tool for you and your team. (GitHub)
Neptune Python Client - Integrate your Python scripts with Neptune.
Synerise - Powerful ecosystem driven by Artificial Intelligence with real-time data orchestration created to drive business growth.
Dataquest - Learn R, Python and SQL for Data Science.
Carpentries - Teach foundational coding and data science skills to researchers worldwide.
Data Engineering Book - Accumulated knowledge and experience in the field of Data Engineering.
Data Science Lifecycle Process - Set of prescriptive steps and best practices to enable data science teams to consistently deliver value.
Data Science Lifecycle Base Repo - Template repository for data science projects using the Data Science Life Cycle Process.
Scalable Data Science - Course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath. (Code)
Data Carpentry - Develops and teaches workshops on the fundamental data skills needed to conduct research.
Data Together - Exploring Community-Driven Data Stewardship. (GitHub)
Data Together Research - Research for tackling the general problem of data resilience & interactivity in all its forms.
Apache Superset - Modern data exploration and visualization platform. (Code)
Elements of Data Science - Introduction to data science in Python, for people with no programming experience. (Code)
Data Science on AWS - AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker. (Code)