como alterar uma coluna do Dataframe do tipo String para o tipo Double no pyspark

99

Eu tenho um dataframe com coluna como String. Eu queria alterar o tipo de coluna para tipo duplo no PySpark.

A seguir está o caminho, eu fiz:

toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

Só queria saber se essa é a maneira certa de fazer isso, pois enquanto estou executando a Regressão Logística, estou recebendo algum erro, então me pergunto, é esse o motivo do problema.

— Abhishek Choudhary
fonte

168

Não há necessidade de um UDF aqui. Columnjá fornece castmétodo com instância :DataType

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

ou string curta:

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

onde nomes de string canônicos (outras variações também podem ser suportadas) correspondem ao simpleStringvalor. Portanto, para tipos atômicos:

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")

BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

e por exemplo tipos complexos

types.ArrayType(types.IntegerType()).simpleString()

'array<int>'

types.MapType(types.StringType(), types.IntegerType()).simpleString()

'map<string,int>'

— zero323
fonte

2

Usar a colfunção também funciona. from pyspark.sql.functions import col, changedTypedf = joindf.withColumn("label", col("show").cast(DoubleType()))

— Staza

Quais são os valores possíveis do argumento cast () (a sintaxe "string")?

— Wirawan Purwanto

Eu não posso acreditar como o Doc Spark estava conciso na string válida para o tipo de dados. A referência mais próxima que consegui encontrar foi esta: docs.tibco.com/pub/sfire-analyst/7.7.1/doc/html/en-US/… .

— Wirawan Purwanto

1

Como converter várias colunas de uma só vez?

— hui chen

Como faço para alterar nullable para false?

— pitchblack408

48

Preserve o nome da coluna e evite a adição de coluna extra usando o mesmo nome da coluna de entrada:

changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

— Patinho
fonte

3

Obrigado, eu estava procurando como manter o nome da coluna original

— javadba

há uma lista em algum lugar dos tipos de dados de string curta que o Spark identificará?

— Alfredox

1

esta solução também funciona esplendidamente em um loop, por exemplofrom pyspark.sql.types import IntegerType for ftr in ftr_list: df = df.withColumn(f, df[f].cast(IntegerType()))

— Quetzalcoatl

10

As respostas dadas são suficientes para lidar com o problema, mas eu quero compartilhar outra maneira que pode ser introduzida na nova versão do Spark (não tenho certeza sobre isso), portanto, a resposta dada não pegou.

Podemos alcançar a coluna na instrução spark com a col("colum_name")palavra-chave:

from pyspark.sql.functions import col , column
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

— Serkan Kucukbay
fonte

5

versão pyspark:

  df = <source data>
  df.printSchema()

  from pyspark.sql.types import *

  # Change column type
  df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
  df_new.printSchema()
  df_new.select("myColumn").show()

— Cristian
fonte

2

a solução era simples -

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

— Abhishek Choudhary
fonte