GOAL¶
Find the relevant data that helps you answer the questions that define the objectives of the project.¶
Identify data sources that contain known examples of answers to your sharp questions. Look for the following data:
- Data that's relevant to the question. Do you have measures of the target and features that are related to the target?
- Data that's an accurate measure of your model target and the features of interest.
For example, you might find that the existing systems need to collect and log additional kinds of data to address the problem and achieve the project goals. In this situation, you might want to look for external data sources or update your systems to collect new data.
Find the data owners that helps you to develop the infraestructure of the project.¶
Does the data available belong to Braskem domain or belong to a third party?¶
Braskem Domain: - Which system? (SAP? Excel Files? Pins?) - Who is the owner of this data?? - Who has the knowledge about this Data? Who can be consulted in case of doubts?
Third Party: - What is the contract we have made with third parties? - What is the (contractual) latency of this data? - Which are the restrictions about the contract?
These are relevant questions that you should know to minimize the problems relate to data ingestion.
Find the Relevant Features from the Data & Data Source¶
- The Database is automated or manual.
- How often we will need to retrieve it (frequency). Streaming, batch or hybrid?
- How are we connecting to the data source?
- What is the schema and format of the data?
- What is the best cloud (Draft) solution for this pipeline?