Data Science

1.1. Data Science#

1.1.1. Why Data Science?#

There are plenty of compelling reasons to study data science. Some of them are:

High job market demand: Data scientists are widely hired across the sectors of tech, healthcare, finance, retail, government, etc.
Transferable skill-sets: Once you learn stats, coding, and problem-solving, you can apply them in many other roles.
Improved decision-making: Data science focuses on obtaining insights from data, which greatly reduces decision pitfalls in organizations.
Foundation for AI/ML: Data science is the gateway to advanced topics in machine learning and AI, such as deep learning, advanced natural language processing (NLP), large language models (LLMs), and generative AI.

1.1.2. What is Data Science#

According to the U.S. Census Bureau [Bureau, 2022], data Science is “a field of study that uses scientific methods, processes, and systems to extract knowledge and insights from data.” As the Data Science Venn Diagram suggested by Drew Conway (Fig. 1.2), data science is by nature interdisciplinary, and data science practitioners draw on varied training in statistics, computing, and domain expertise.

If we put the related data work fields together, we can see that the data science (broadly defined) fields overlap with each other:

From Fig. 1.3, we can try to define data science by looking at how it relates to other fields. The figure can be understood as having 3 layers: The business layer (dark green), the data layer (light green), and the machine learning/AI layer (purple). We may observe that:

Data science consists most of the fields of data analytics and machine learning.
Data science also has a huge overlap with business analytics.
Data science can be considered a subset of data analytics, just like machine learning.
Business intelligence can almost be considered as a subset of data science with heavy business applications.
AI may be considered as an extension of machine learning (but still has a certain overlap with data science).

A Historical Note

From the perspective of decision-making, particularly in the business context, researchers and practitioners have been leveraging various data tools to enhance the effectiveness and competitiveness of organizations. For example, chronically, the data fields have emerged as:

Table: Evolution of Data Fields

Era/Period	Key Technologies & Developments
1970s	Relational databases (SQL), Decision Support Systems (DSS)
1980s	OLTP (Extract → Transform → Load) systems, data modeling (ER), Executive Information Systems (EIS)
1990s	Data warehousing (ETL, or “Extract → Transform → Load”), OLAP (Online Analytical Processing), Business Intelligence (BI: dashboards), Data Mining, KDD (Knowledge Discovery in Databases)
2000s	BI (dashboards & KPIs), Business Analytics
2000s–2010s	“Data Analytics” for descriptive/diagnostic analysis
Late 2000s–2010s	Big Data (Hadoop/NoSQL/Spark to meet the volume–velocity–variety features of data)
2010s	Data Science (end-to-end: data wrangling → modeling → communication/impact)
2010s–2020s	Modern ML, NN (neural networks), & Deep Learning
2020s	Responsible AI, GenAI

As observed, over time, the data-backed decision-making fields have evolved from descriptive (BI/OLAP/EDA/visualization) to predictive and prescriptive (LLM ⊂ DL ⊂ NN ⊂ ML ⊂ AI).

The Data Science Process

The CRISP-DM model

As a general process model for conducting data science, the Cross-Industry Standard Process for Data Mining, commonly referred to as CRISP-DM, is widely adopted as a de facto standard for describing common approaches to data mining and data science projects. There are 6 phases in this process model:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

../../_images/CRISP-DM_process_diagram.png — Fig. 1.4 CRISP-DM Process Model#

In addition to the process model, CRISP-DM also has a methodology that contains specific, generic tasks in each phase, as shown in the table below.

Table: CRISP-DM tasks in each phase

I. Business Understanding	II. Data understanding	III. Data Preparation	IV. Modeling	V. Evaluation	VI. Deployment
Determine business objectives	Collect initial data	Select data	Select modeling techniques	Evaluate results	Plan deployment
Assess situation	Describe data	Select data	Generate test design	Review process	Plan monitoring and maintenance
Determine data mining goals	Explore data	Clean data	Build model	Determine next steps	Produce final report
Produce project plan	Verify data quality	Construct data	Assess model		Review project
		Integrate data
		Format data

General Data Science Lifecycle Models

In the industry, practitioners create their own process models based on the CRISP-DM process model. For example, {numref}`general-data-science-lifecycle` is a general data science lifecycle model with an added exploratory data analysis (EDA) process, and combining the business understanding and data understanding phases.

../../_images/general-data-science-lifecycle.png — Fig. 1.5 General data science lifecycle model#

Data Science Careers

There are plenty of jobs and career opportunities in the general field of data science. From the perspective of data science workflow/lifecycle ({numref}`data-science-lifecycle-and-jobs`), we see that the four common data science related jobs roughly correspond with different phases of the workflow: Data Engineers with data collection and cleaning/cleansing, data analysts with data cleaning and EDA, machine learning engineers for model building and model deployment, while data scientists for the whole process of the workflow.

../../_images/data-science-workflow-and-jobs.png — Fig. 1.6 Data science workflow and related job titles#

It is first noticed that, as technology advances, AI has become central to modern data work and is now foundational across the data stack. A quick search at an online job site (indeed.com ) using the term “data scientists” yields job titles showing that data science and AI are interconnect:

Generative AI/ML Data Scientist
Senior Agentic AI Data Scientist
Manager, Data Science - GenAI Digital Assistant
Senior Advanced AI Software Engineer
Data Scientist Lead - Bank AI/ML

Further, when searching for these related job titles, the results show that the term “Data Scientist” yields most (5000+) jobs, followed by “Data Engineer”, while “artificial intelligence” returns 8000+ jobs.

Table: Data Science Jobs

	Data Engineer	Data Analyst	Machine Learning Engineer	Data Scientist
Aug 2025	4,000+	3,000+	1,000+	5,000+
Jan 2026	139,000+	115,000+	111,000+	12,000+

However, when searching using the field/skill-set terms: “Data Engineering”, “Data Analysis”, “Machine Learning”, “Data Science”, and add “Artificial Intelligence,” the term “data analysis” returns most job results, followed by AI. This result at least suggests that machine learning and AI have become essential in the data science job market.

Table: Data Science Skills | | Data Engineering | Data Analysis | Machine Learning | Data Science | Artificial Intelligence | | |— | — | — | — | —| | Aug 2025 | 5,000+ | 9,000+ | 7,000+ | 7,000+ | 8,000+ | | Jan 2026 | 158,000+ | 173,000+ | 71,000+ | 144,000+ | 63,000+ |

Data Science Tools

As an attempt to summarize data science tools in four stages of data science operations (data management, data manipulation, data analysis, and visualization), a researcher posted his data science tools summary on LinkedIn to ask for feedback.

As seen in Fig. 1.7, the figure contains 4 groups of data science tools:

Data Management: Tools in this category are databases for data collection/acquisition and cleaning/cleansing. Most of the listed tools are SQL DBMSs (except MongoDB, Redis, and Neo4j).
Data Manipulation: Programming languages such as Python and R, plus libraries such as pandas and NumPy, are included.
Data Analysis: scikit-learn is a popular toolkit for machine learning analysis based on NumPy, SciPy, and matplotlib. PyTorch and TensorFlow are popular open-source machine learning frameworks primarily used for building and training deep learning models.
Data Visualization: Matplotlib is a comprehensive library for creating visualizations in Python. Seaborn is a visualization library built on top of matplotlib, offering a high-level interface for creating statistical graphics.
Other Tools: Jupyter Notebook, along with other Jupyter products, is a web-based interactive development environment allowing for the creation and sharing of documents, code, and visualization. Jupyter Notebook has evolved into a type of Integrated Development Environment (IDE), supporting more than 40 languages, including Python, R, and Scala.

Note that while there are a large number of data science tools to choose from, beginners may pick one from each of the categories to start learning. Once you master a tool, it is often very easy to transfer the concepts and skills to another in the same category. For example, all SQL tools are different flavors of the same language standard; and almost all programming languages share the same basic constructs.

Resources

https://towardsdatascience.com/
https://365datascience.com/
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0: Step-by-step data mining guide. The CRISP-DM Consortium.
https://realpython.com/pytorch-vs-tensorflow/